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Abstract 


This  thesis  proposes  a  novel  method  for  cooperatively  estimating  the  positions  of  two 
vehicles  in  a  global  reference  frame  based  on  synchronized  image  and  inertial 
information.  The  proposed  technique  -  cooperative  binocular  stereopsis  -  leverages  the 
ability  of  one  vehicle  to  reliably  localize  itself  relative  to  the  other  vehicle  using  image 
data  which  enables  motion  estimation  from  tracking  the  three  dimensional  positions  of 
common  features.  Unlike  popular  simultaneous  localization  and  mapping  (SLAM) 
techniques,  the  method  proposed  in  this  work  does  not  require  that  the  positions  of 
features  be  carried  forward  in  memory.  Instead,  the  optimal  vehicle  motion  over  a  single 
time  interval  is  estimated  from  the  positions  of  common  features  using  a  modified  bundle 
adjustment  algorithm  and  is  used  as  a  measurement  in  a  delayed  state  extended  Kalman 
filter  (EKF).  The  developed  system  achieves  improved  motion  estimation  as  compared  to 
previous  work  and  is  a  potential  alternative  to  map-based  SLAM  algorithms. 
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IMAGE- AIDED  NAVIGATION 


USING  COOPERATIVE  BINOCULAR  STEREOPSIS 


I.  Introduction 


utonomous  navigation  in  indoor  and  urban  environments  is  a  challenging  problem. 


1 .  \.  Arguably  the  most  popular  means  of  localization  in  many  navigation  systems 
involves  the  fusion  of  Global  Postioning  System  (GPS)  signals  and  the  inertial  signals  of 
an  intertial  measurement  unit  (IMU).  This  technique  capitalizes  on  an  IMU’s  ability  to 
capture  quick  motion  and  the  ability  of  GPS  to  constrain  long-term  drift  [20].  This 
approach,  however,  becomes  degraded  or  unreliable  in  environments  where  GPS  signals 
are  either  severely  attenuated  or  denied  -  a  typical  characteristic  of  indoor  and  urban 
environments.  The  goal  of  this  thesis  is  to  develop  a  cooperative  indoor  navigation 
algorithm  for  a  system  of  two  vehicles  which  are  able  to  communicate  image  and 
navigational  data  with  each  other. 

1.1  Background 

With  the  recent  expansions  in  computational  capability  and  the  streamlining  of 
computer  vision  algorithms,  visual  sensors  (e.g.  cameras)  have  become  an  attractive 
substitute  for  GPS  receivers  in  the  sensor-aided  IMU  framework.  Visual  sensors  provide 
number  of  benefits,  such  as  low  cost  and  weight.  These  sensors  are  also  able  to  measure 
distinct  features  in  the  environment,  which  is  useful  for  constructing  a  map  of  the 


environment. 


The  ARDrone  quadrotor  helicopter,  developed  by  Parrot  SA,  is  a  commercially 
available  platform  which  sports  a  number  of  navigational  sensors,  including  a  forward 
looking  camera,  an  optical  flow  sensor,  altitude  sensors,  and  inertial  sensors  [1],  The 
ARDrone  also  comes  equipped  with  an  on-board  processor  which  enables  wireless 
communication,  filters  inertial  sensor  measurements,  and  stabilizes  the  vehicle  while 
airborne.  The  abundant  sensors  and  processor  features,  as  well  as  ease  of  use,  make  the 
ARDrone  a  viable  tool  for  developing  image-aided  navigation  algorithms. 

1.2  Problem 

Previously,  Dean  [18]  developed  a  real-time  image  processing  algorithm  to  extract 
heading  cues  from  the  image  stream  of  the  ARDrone’s  forward  looking  camera.  These 
heading  cues  were  used  to  constrain  rotational  errors  in  the  vehicle’s  filtered  navigation 
estimates  and  correct  directional  errors  in  the  velocities  computed  by  the  vehicle’s 
on-board  image  processing  algorithm.  However,  while  the  heading  error  was  constrained, 
it  was  found  that  the  optical  flow  algorithm  running  on-board  returned  a  poor  estimate  of 
the  vehicle’s  velocity,  which  ultimately  led  to  a  poor  estimate  of  the  vehicle’s  trajectory.  In 
addition,  the  optical  flow  algorithm  required  that  texture  be  added  to  the  flight  path  in 
order  to  return  a  reasonable  estimate,  which  is  not  feasible  for  an  autonomous  system. 

This  research  is  focused  on  developing  a  new  image  processing  algorithm  which 
more  accurately  estimates  the  ARDrone’s  trajectory  without  having  to  alter  the  operating 
environment.  The  developed  algorithm  leverages  additional  sensor  information  from  a 
supporting  ARDrone  operating  within  its  field  of  view  and  processes  the  image  and 
onboard  navigation  information  from  both  vehicles  to  localize  both  vehicles  in  a  global 
reference  frame. 
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1.3  Scope 

The  central  idea  in  this  thesis  is  that  camera  motion  can  be  estimated  from  tracking 
the  collective  motion  of  features  through  a  sequence  of  images.  A  feature  is  a  point  in  the 
environment  that  can  be  distinguished  from  the  area  surrounding  it  -  e.g.  corners  and 
edges.  Feature  tracking  consists  of  determining  corresponding  image  projections  of  the 
same  feature  in  separate  images.  Assuming  that  each  tracked  feature  is  stationary  in  the 
global  reference  frame,  the  change  in  its  projection  onto  the  image  plane  must  be  a  result 
of  camera  motion. 

Feature  localization  in  this  research  is  achieved  through  a  cooperative  adaptation  of 
binocular  stereopsis.  In  traditional  binocular  stereopsis,  a  stereo  pair  of  cameras  is 
mounted  on  a  single  vehicle  and  calibrated  prior  to  flight  to  determine  the  relative 
orientation  and  baseline  between  cameras,  as  well  as  the  individual  camera  calibration 
matrices  [28].  These  parameters  can  be  used  to  determine  the  three  dimensional 
coordinates  of  feature  points  common  to  both  images.  In  this  research,  two  vehicles  are 
flown  in  formation  such  that  a  leading  vehicle  can  be  identified  in  the  corresponding 
image  of  the  trailing  vehicle.  The  relative  translation  and  orientation  for  the  pair  of 
cameras  can  be  re-estimated  for  each  set  of  corresponding  images  -  essentially 
recalibrating  the  stereo  pair  for  each  set  of  images.  Features  can  then  be  tracked  over 
consecutive  image  pairs  to  determine  the  motion  of  each  vehicle. 

The  system  developed  in  this  thesis  will  be  confined  to  operate  indoors.  Specifically, 
the  algorithms  developed  in  this  thesis  and  taken  from  previous  work  [18]  take  advantage 
of  the  structural  properties  of  hallways  to  achieve  localization.  In  addition,  the  lighting 
conditions  and  amount  of  detectable  features  will  be  assumed  not  to  be  an  issue  for 
algorithm  developed  which  performs  feature  tracking. 
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1.4  Related  Work 


Motion  estimation  by  feature  tracking  is  typically  implemented  in  a  technique  called 
simultaneous  localization  and  mapping  (SLAM).  A  SLAM  algorithm  includes  the 
coordinates  of  identified  features  as  random  variables  in  an  estimator.  Over  time, 
measurements  will  reduce  the  uncertainty  in  each  tracked  feature’s  location.  Theoretically, 
as  the  uncertainty  in  a  feature’s  location  is  minimized,  the  feature  will  become  a  landmark 
and  the  vehicle’s  position  uncertainty  will  decrease.  SLAM  was  first  developed  in  [59] 
and  has  been  explored  in  works  such  as  [68],  [14],  [19],  and  [56]. 

While  theoretically  sound,  SLAM  algorithms  in  practice  are  prone  to  inconsistency 
[36]  [5].  Inconsistency  is  defined  as  the  condition  in  which  the  error  covariance  of  the 
vehicle’s  position  estimate  is  underestimated.  In  addition,  the  large  structure  of  the  SLAM 
problem  is  not  typically  feasible  for  small  vehicles  with  limited  payload  for  computing 
power.  Submapping  techniques  are  a  subset  of  the  larger  SLAM  literature  which  seek  to 
reduce  computation  and  error  by  dividing  an  environment  into  maps  populated  by  feature 
points  [50]  [51]  [13]  [22].  In  Pose  SLAM  [32]  [38]  [41],  the  features  are  eliminated 
altogether  and  the  map  consists  of  a  finite  or  infinite  number  of  past  vehicle  positions. 

Alternatively  to  maintaining  all  tracked  features  in  a  SLAM  network,  the  motion  of  a 
feature  over  time  can  be  used  in  a  navigation  estimator  without  including  its  position  as  a 
random  variable.  In  [48],  [7],  and  [20]  the  changing  locations  of  features  across  images 
were  used  to  derive  constraints  on  vehicle  motion.  In  [48]  specifically,  the  difference  in 
the  location  of  a  feature  is  related  to  the  current  vehicle  position  and  a  history  of  previous 
vehicle  positions.  Each  feature  produces  a  constraint  on  the  positions  of  the  vehicle  over 
time,  similar  to  the  algorithm  developed  in  this  thesis.  In  contrast,  the  algorithm  developed 
in  this  thesis  handles  all  tracked  features  simultaneously  to  produce  a  single  measurement 
which  corresponds  to  the  difference  between  each  vehicle’s  current  position  and  its 
position  at  the  previous  time-step. 
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Another  key  feature  of  this  work  is  the  cooperative  aspect.  Cooperative  localization 
has  been  explored  in  a  number  of  different  facets.  In  [61]  and  [19],  a  network  of  vehicles 
navigates  in  an  alternating  fashion  where  vehicles  take  turns  acting  as  landmarks  for  the 
rest  of  the  vehicles  in  the  group  to  navigate.  This  approach  is  generally  not  feasible  for 
airborne  vehicles.  In  [45]  and  [34],  all  of  the  vehicles  move  continuously  and  the  relative 
position  of  each  vehicle  is  inferred  from  features  that  are  common  to  both  of  their 
cameras.  The  system  designed  in  this  thesis  uses  a  combination  of  these  two  approaches. 

1.5  Research  Contributions 

This  research  expands  the  developed  system  model  outlined  in  [18]  to  enable 
estimation  of  the  three  dimensional  position  of  two  vehicles.  This  research  also  contributes 
a  cooperative  localization  algorithm  and  a  flexible  state  estimation  algorithm,  based  on  the 
expanded  system  model,  which  is  able  to  handle  a  varying  number  of  measurements. 

1.6  Research  Outline 

The  research  begins  by  modifying  existing  software  to  support  a  second  vehicle  and 
multiple  vehicle  communications.  With  the  expanded  software  structure  in  place,  the 
cooperative  binocular  stereopsis  algorithm  is  developed  to  capture  and  process  a  stereo 
image  pair  and  return  a  structure  of  image  measurements.  These  measurements  include 
vanishing  points,  as  described  in  [18];  the  estimated  camera  baseline;  and  a  varying 
number  of  three-dimensional  feature  points,  found  through  linear  triangulation.  As  an 
intermediate  step,  the  estimated  baseline  and  feature  points  are  processed  simultaneously 
to  estimate  the  change  in  position  and  attitude  for  each  vehicle  between  two  time  epochs 
using  nonlinear  regression.  The  state  estimation  algorithm  is  then  developed  to  handle  the 
measurements  in  a  delayed  state  Kalman  filter.  A  series  of  experiments  are  then  presented 
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to  validate  the  system  in  stages:  starting  from  validating  the  image  processing  algorithm, 
including  feature  matching,  and  leading  up  to  validating  the  full  Kalman  filtering 
algorithm. 

1.7  Thesis  Outline 

Chapter  2  provides  background  information  on  the  topics  of  vision  aided  navigation, 
computer  vision,  autonomous  navigation,  and  cooperative  vehicle  behavior.  The  purpose 
is  two-fold:  to  introduce  mathematical  concepts  and  conventions  that  are  fundamental  to 
this  thesis  and  to  provide  an  analysis  of  current  and  past  approaches  to  the  localization 
problem.  Chapter  3  develops  the  algorithms,  code  architectures,  and  system  model  used  in 
this  research.  Chapter  4  describes  the  experiments  used  to  test  the  developed  system  and 
presents  their  results.  Chapter  5  contains  conclusions  and  suggestions  for  further  research. 
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II.  Background 


This  chapter  discusses  relevant  topics  in  vision  navigation  as  they  apply 

to  the  research  described  in  this  thesis.  The  chapter  begins  by  introducing  computer 
vision  as  it  applies  to  navigation  using  vision.  Next,  the  area  of  navigation  using  vision  is 
described  -  focusing  on  the  case  of  vision-aiding  for  indoor  navigation.  Techniques  for 
feature  matching  and  outlier  rejection  are  then  introduced.  Next,  the  Kalman  filter 
equations  used  in  this  thesis  are  presented.  Various  localization  mechanisms  are  then 
presented,  including  SLAM,  multi- view  geometric,  and  batch  methods.  The  topic  of 
localization  is  then  extended  to  the  case  of  multiple  cooperative  vehicles  and  a  brief 
discussion  of  autonomous  systems  is  presented.  This  thesis  follows  the  work  presented  in 
[18],  in  which  perspective  features  were  used  to  correct  heading  errors  for  a  single 
navigating  quadrotor.  The  chapter  closes  with  an  overview  of  the  system  designed  in  [18]. 

2.1  Computer  Vision 

A  core  component  of  this  research  is  computer  vision.  Fundamentally,  computer 
vision  is  derived  from  projective  geometry,  where  3D  space  is  captured  on  an  infinite  2D 
plane.  In  computer  vision,  however,  the  dimensions  and  projection  resolution  of  the  2D 
plane  are  limited  by  the  visual  sensor.  An  illustration  of  projection  as  it  applies  to 
computer  vision  is  shown  in  Figure  2.1. 

Computer  vision  seeks  mathematical  analogues  to  perception.  For  instance,  sharp 
edges  in  an  image  correspond  to  high  spatial  frequencies.  Colors  in  an  image  can  be 
represented  by  a  vector  describing  the  combination  of  a  basis  color  set  -  such  as  red, 
green,  and  blue  (RGB).  These  mathematical  interpretations  are  what  enable  artificial 
perception  in  digital  systems. 
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Figure  2.1:  Projective  Geometry.  3-dimensional  points  from  a  scene  are  represented  on 
a  2-dimensional  plane.  The  image  plane  in  computer  vision  is  constrained  by  physical 
dimensions  W  and  H  [66] . 


2.2  Vision-Aided  Navigation 

It  is  important  to  make  a  subtle  but  important  note  on  the  similarities  and  differences 
between  vision-based  and  vision-aided  navigation  systems.  Vision-based  navigation 
systems  seek  to  localize  a  body  based  only  on  visual  data.  Techniques  that  fall  into  the 
vision-based  category  include  structure  from  motion  and  bundle  adjustment  [21]  [33]  [65]. 
An  example  of  a  vision-based  system  is  [54],  where  scene  analysis  is  used  to  extract 
significant  landmarks  and  enable  goal-seeking  navigation.  Vision-based  navigation 
systems  can  be  limited  by  the  computational  cost  of  each  image  processing  algorithm,  the 
accuracy  of  the  imaging  sensors,  and  the  accuracy  of  the  updating  mechanism. 
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In  contrast,  a  vision-aided  navigation  system  synthesizes  visual  data  to  aid  a  dynamic 
model.  A  dynamic  model  provides  an  estimate  of  position  based  on  the  equations  of 
motion.  A  simple  dynamic  model  takes  the  form 
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where  X,Y  are  positions,  Vx,  Vy  are  velocities,  and  6va,6vt,  are  velocity  correction  inputs  to 
the  system.  Visual  data  is  then  used  to  refine  the  model’s  estimate.  Examples  of  vision 
aided  navigation  algorithms  are  found  in  [35],  [34],  and  [42]. 


2.3  The  Indoor  Environment 

It  is  also  important  to  characterize  the  environment  in  which  a  vision-aided  navigation 
system  will  operate.  In  an  indoor  environment,  the  geometry  of  man-made  structures  can 
be  exploited  for  navigation  purposes.  In  the  previous  work  [18],  the  known  orthogonality 
of  the  indoor  environment  was  exploited  to  calculate  the  theoretical  intersection  of  parallel 
lines  -  the  vanishing  point.  An  illustration  of  this  idea  is  shown  in  Figure  2.2. 

The  disadvantage  of  the  indoor  environment,  from  a  feature  tracking  perspective,  is 
that  a  blank  section  of  a  wall  might  only  provide  sparse  features  which  are  difficult  to 
identify  or  match  over  time.  Additionally,  dimly  lit  structures  can  cause  problems  for 
imaging  sensors.  These  problems  will  be  assumed  negligible  in  this  research  by  operating 
in  well-lit  environments. 
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Figure  2.2:  Manipulating  Indoor  Structure  for  Determining  Vanishing  Points.  The 
vanishing  point  algorithm  in  [18]  hinges  on  the  assumption  that  the  intersecting  lines  shown 
are  projections  of  parallel  lines  in  an  indoor  landscape. 


2.4  Relevant  Coordinate  Frames 

The  following  coordinate  frames  are  defined  for  the  reader  as  they  are  used 
throughout  this  document.  These  frames  were  defined  in  [66],  and  are  presented  here 
merely  for  completeness. 

Pixel  Frame:  The  pixel  fxa.me,{xpix  ,ypix)  is  the  image  plane  shown  in  Figure  2.1.  It  is 
a  two-dimensional  frame  in  which  the  projected  coordinates  of  a  3D  point  are  captured. 
The  x  and  y  axes  represent  the  columns  and  rows,  respectively,  expressed  in  pixels.  In  this 
document,  the  origin  of  the  pixel  frame  is  defined  as  the  upper  left-hand  corner  of  the 
pixel  frame,  as  shown  in  Figure  2.3. 

Camera  Frame:  The  camera  frame  {Xc,  Yc,Zcj  is  a  three-dimensional,  orthogonal, 
right-handed  frame  with  its  origin  at  the  center  of  the  camera  lens.  In  this  frame,  the 
positive  z-axis  is  directed  outward  from  the  camera  lens,  the  positive  y-axis  is  directed 
downward,  and  the  positive  x-axis  is  directed  to  the  right.  An  illustration  is  shown  in 
Figure  2.4.  The  camera  frame  is  related  to  the  pixel  frame  by  the  equation 


10 


Figure  2.3:  Pixel  Frame  of  the  ARDrone’s  Forward  Looking  Camera.  Three-dimensional 
points  from  a  scene  are  projected  onto  a  two-dimensional  plane  and  are  given  coordinates 
in  pixels  [18]. 


spix  = 


rr\piX  _ 

c  — 


1 


■  Tp''tsc 


[sc]z 

fx  y  cx 

0  fy  Cy 

0  0  1 


(2.2) 


where  sc  is  a  vector  to  a  target  in  the  camera  frame  and  Tpix  is  the  camera  calibration 
matrix  which  has  the  following  parameters:  the  x  and  y  focal  lengths  fx,  fy;  the  skew  7; 
and  the  camera  center  coordinates  cx,  cy  in  pixels.  These  parameters  are  determined 
through  camera  calibration.  Note  that  [s'j-  denotes  the  z  component  of  sc. 
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Figure  2.4:  Camera  Frame  of  the  ARDrone.  The  camera  frame  is  a  right  handed  frame 
with  the  origin  located  at  the  center  of  the  ARDrone’s  foward  looking  camera. 


Body  Frame:  The  body  frame  {Xb,  Yb ,  Zb }  is  attached  to  the  vehicle  and  is  the  frame 
in  which  measurements  are  resolved.  The  Euler  angles  roll,  pitch,  and  yaw  (<p,  8,  if/)  are 
defined  as  a  sequence  of  rotations  about  the  x,  y,  and  z-axes,  respectively,  of  the  body 
frame.  The  body  frame  is  superimposed  on  the  ARDrone  airframe  in  Figure  2.5.  The  body 
frame  is  related  to  the  camera  frame  by  the  coordinate  transform 


C 


b 

c 


0  0  1 
1  0  0 


0  1  0 


(2.3) 


where  Cb  is  a  direction  cosine  matrix  (DCM)  [63]  which  rotates  points  in  the  camera 
frame  into  the  body  frame. 


Navigation  Frame:  The  navigation  frame  {X”,  Yn,Zn }  is  an  earth  fixed,  local  level 
frame.  The  origins  of  the  body  and  navigation  frame  coincide  when  the  drone  is 
initialized,  but  the  navigation  frame  remains  fixed  as  the  body  frame  rotates  and  translates 
in  flight.  Quantities  will  ultimately  be  expressed  in  the  navigation  frame,  as  this  frame 
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Figure  2.5:  Body  Frame  of  the  ARDrone.  In  this  frame,  the  x-axis  extends  out  from  the 
nose  of  the  vehicle,  the  y-axis  extends  out  the  right  wing,  and  the  z-axis  extends  downward 
[18]. 


provides  meaningful  information  to  the  user.  The  navigation  frame  is  related  to  the  body 
frame  by  the  coordinate  transform  [63] 
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where  Cnb  is  also  a  DCM. 
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(2.4) 


2.5  Feature  Matching 

Feature  matching  is  the  process  by  which  points  in  separate  images  are  determined  to 
belong  to  the  same  three  dimensional  feature.  This  process  consists  of  three  stages: 
extraction,  description,  and  descriptor  matching.  There  are  a  number  of  algorithms  that 
exist  for  all  three  stages  and,  for  the  most  part,  these  algorithms  can  be  combined  in  any 
number  of  ways. 


2.5.1  Feature  Extraction. 


Feature  extraction  is  the  process  by  which  distinguishable  points  or  shapes  in  an 
image  are  identified  according  to  a  specific  criterion.  The  goal  of  feature  extraction  is  to 
sample  an  image  for  keypoints  which  have  a  high  probability  of  being  found  in  another 
image.  Feature  extraction  algorithms  can  be  classified  into  two  main  categories:  local 
intensity  methods  and  spatial  frequency  methods. 

Local  intensity  methods  -  such  as  the  Harris  comer  detector,  Lucas-Kanade  tracker, 
and  others  [58] [25]  -  identify  regions  in  an  image  where  pixel  intensities  change  rapidly 
by  calculating  the  eigenvalues  of  a  matrix  of  image  gradients.  As  image  detail  increases, 
so  do  the  eigenvalues  of  this  matrix  and  therefore  keypoints  can  be  identified  by 
thresholding  the  eigenvalues  [58]. 

Identifying  keypoints  which  have  a  large  change  in  intensity  does  not  guarantee  that 
the  keypoints  will  be  reliably  identified  in  another  image;  a  more  robust  method  is  desired. 
As  an  example,  consider  the  scale  invariant  feature  transform  (SIFT)  algorithm  developed 
by  Lowe  [43].  SIFT  uses  a  difference-of-Gaussian  to  find  stable  keypoints.  An  advantage 
of  this  algorithm  is  its  ability  to  find  keypoints  which  will  be  invariant  to  scale  changes, 
which  makes  this  algorithm  very  popular;  however,  the  algorithm  will  still  fail  under  large 
affine  transforms.  A  typical  image  can  return  nearly  2000  stable  SIFT  keypoints  [43]. 

2.5.2  Feature  Description. 

The  goal  of  feature  description  is  to  describe  a  feature  point  using  the  image  area 
surrounding  it.  This  description  should  be  unique  enough  to  distinguish  it  from  other 
features,  but  generic  enough  to  allow  the  same  feature  to  be  described  in  the  same  way 
under  changing  environment  conditions  (e.g.  scale,  rotation,  and  lighting).  For  example, 
SIFT  has  a  128-element  Euclidean  descriptor  vector  determined  by  the  local  image 
gradient  surrounding  the  feature  point  [43].  The  SURF  algorithm,  a  SIFT  variant,  detects 
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features  based  on  the  Hessian  matrix  and  integral  images,  resulting  in  feature  descriptors 
which  are  more  robust  than  SIFT  and  only  have  64  elements  [6]. 

Recently,  binary  descriptors  have  become  a  more  attractive  option  for  feature 
description  in  real-time  applications.  As  its  name  would  suggest,  a  binary  descriptor  seeks 
to  uniquely  define  a  point  in  an  image  using  a  vector  of  binary  values.  Binary  descriptors 
are  attractive  from  a  computational  perspective,  since  computing  the  distance  between  two 
binary  vectors  simplifies  to  boolean  arithmetic.  An  example  of  a  binary  descriptor 
algorithm  is  the  binary  robust  invariant  scalable  keypoints  (BRISK)  algorithm  [40].  The 
BRISK  descriptor  is  constructed  by  thresholding  the  intensity  of  points  sampled  near  the 
feature  of  interest.  Points  which  exceed  the  threshold  are  given  a  value  of  1,  and  points 
below  the  threshold  are  given  value  0. 

2.5.3  Feature  Descriptor  Matching. 

Feature  descriptor  matching  is  the  final  process  in  feature  matching  which  attempts  to 
find  a  set  of  matching  keypoints  based  on  the  computed  distance  between  descriptors.  For 
Euclidean  descriptors  (e.g.  SIFT)  the  distance  is  the  vector  norm  of  the  difference  of  the 
two  descriptor  vectors.  For  binary  descriptors  (e.g.  BRISK)  the  distance  is  computed  by 
an  XOR  operation  of  each  corresponding  element  of  the  two  descriptor  vectors  followed 
by  a  bit  count  [40].  The  latter  is  much  faster  to  compute  than  a  vector  norm. 

2.6  Model  Fitting  Using  RANSAC 

Consider  the  problem  of  determining  the  outliers  in  a  series  of  ( X ,  Y)  data  points.  A 
naive  approach  would  be  to  determine  the  line  of  best  fit  using  all  data  and  then  eliminate 
the  outliers.  Including  all  data  can  greatly  skew  the  line  of  best  fit.  In  the  RANSAC 
approach  [24],  the  line  of  best  fit  would  be  estimated  from  a  small,  random  subset  of  the 
entire  dataset,  which  reduces  the  impact  of  outliers. 

At  each  iteration,  the  RANSAC  algorithm  determines  a  model  for  the  entire  set  based 
on  a  random  subset  of  data.  The  number  of  data  points  that  agree  with  the  model  -  the 
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inliers  -  are  counted.  If  there  are  enough  inliers,  the  model  is  accepted  and  the  algorithm 
is  finished.  Otherwise,  the  algorithm  iterates  through  another  subset  or  ends  in  failure  if 
the  maximum  number  of  iterations  is  reached.  A  generic  RANSAC  algorithm  is  presented 
in  Figure  2.6. 


1 :  procedure  Ransac 
Require:  a  mapping  M{B  <—  A) 

2:  given  data  sets  A  and  B 

3:  parameters  n  —  subset  size,  r  =  threshold,  i  =  max  iterations 

4:  N  -  set  size  for  consensus 

5:  select  random  set  C  c  A  of  size  n 

6:  repeat 

7:  determine  a  mapping,  M(B  <—  C) 

8:  VaeA,/?efl 

9:  A*  =  {a  B  || M(a)  —  p\\  >  t) 

10:  if  size(A*)  >  N  then 

11:  M(B  <—  A)  =  M() 

12:  done 

13:  else  if  number  of  iterations  <  i  then 

14:  select  new  random  C  c  A  of  size  n 

15:  else 

16:  end  in  failure 

17:  done 

18:  end  if 

19:  until  done 

20:  end  procedure 

Figure  2.6:  Generic  RANSAC  Algorithm.  Rather  than  use  all  data,  RANSAC  attempts 
to  find  a  model  fit  from  a  random  subset  of  data  points.  This  method  is  more  resistant  to 
outliers  than  standard  least  squares  estimation. 
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In  the  context  of  the  feature  matching  scenario  in  the  previous  section,  RANSAC 
proves  to  be  very  useful  for  pruning  out  erroneous  feature  matches.  Feature  extraction  and 
matching  produces  an  initial  pairing  of  points  in  each  view,  of  which  a  handful  of  matches 
may  be  incorrect.  After  defining  a  mathematical  mapping  between  the  pairs  of  points 
(such  as  the  fundamental  matrix),  RANSAC  can  be  used  to  simultaneously  determine  the 
best  mapping  and  remove  feature  correspondences  which  do  not  fit  this  mapping. 

2.7  Kalman  Filtering 

A  Kalman  filter  is  a  recursive  algorithm  which  provides  an  estimate  of  the  mean  and 
covariance  of  a  stochastic  differential  equation  based  on  the  previous  estimate, 
deterministic  inputs,  measurements,  and  independent  white  Gaussian  noises.  There  are  a 
number  of  variations  of  the  Kalman  filter  derived  in  literature.  For  this  thesis,  the  Kalman 
filter  will  be  described  as  a  two  step  process  consisting  of  a  linear  propagation  [44]  and  a 
delayed  state  update  [11]. 

2.7.1  Linear  Kalman  Filter  Propagation. 

A  Kalman  filter  is  derived  from  a  stochastic  differential  equation  of  the  form 

x(t)  =  F  (f)x(f)  +  B(t)u(0  +  G(f)w(f)  (2.5) 

where  F(t)  is  the  dynamics  matrix,  x(/)  is  the  state  of  the  equation,  B(t)  is  the  input  matrix, 
u(t)  is  a  vector  of  inputs,  G(t)  is  the  noise  matrix,  and  w(/)  is  a  vector  of  white  Gaussian 
noise  sources  having  the  properties 


E[w(t)]  =  0 

E[w(f)w(Or]  =  Q(f)<S(f-0  (2.6) 
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where  £[•]  is  the  expectation  operator,  6(1  -  t')  is  the  dirac  delta  function,  and  Q(/)  is  a 
matrix  of  noise  strengths  at  time  t.  In  the  discrete-time  implementation  [44],  propagation 
is  defined  by  the  equation 


x(4 )  =  0(4, 4_i)x(^_j)  +  Bj(4)u(4_i)  (2.7) 

where  the  x(Z7)  is  the  propagated  state  estimate  at  time  4  prior  to  a  measurement  update 
(indicated  by  the  superscript  minus)  and  x(^_,)  is  the  updated  state  at  time  4_i  (indicated 
by  the  superscript  plus).  The  state  transition  matrix  0(4, 4_i)  is  defined  as  [44] 


mk,tk-i)  =  emHtk~tk-l) 


and  can  be  approximated  by 


(2.8) 


0(4, 4-0  -  I  +  F(4)  •  (4  -  4_i)  (2.9) 

when  the  time  difference  between  4  and  tk-i  is  relatively  small  compared  to  the  dynamics 
equations.  Similarly,  the  approximation  [44] 


Ba( 4)  -  B(4) '  (4  _  4-i)  (2.10) 

for  the  discrete  input  matrix  B^(4)  can  also  be  made  when  the  time  difference  is  small. 

The  covariance  of  the  state  vector  after  propagation  P(Z7)  is  computed  by  the 
equation  [44] 


P(4)  =  0(4,4-i)P(4+-i)Or(4,4-i)  +  Qa(4)  (2.11) 


where  the  approximation 


Qd(tk)  =  G(4)Q(4)Gr(4)  •  (4  -  4-i)  (2.12) 
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for  the  discrete  noise  matrix  Qd(tk)  can  be  made  when  the  time  difference  is  small. 

2.7.2  Delayed  State  Kalman  Filter . 

Traditionally,  a  Kalman  filter  represents  measurements  as  a  function  of  the 
propagated  state  x(^).  However,  in  some  situations,  it  is  more  appropriate  to  relate  the 
measurement  to  the  propagated  state  at  time  k  and  the  updated  state  at  time  k—  1 .  This 
measurement  equation  takes  the  form  [11] 


zk  =  h(x(4 ),  x(£_j))  +  \k  (2.13) 

where  zk  is  the  realized  measurement,  h(x(/^),  xf/^, ))  is  a  nonlinear  measurement 
function,  and  \k  is  additive  white  Gaussian  noise  with  noise  strength  R.,.  The  update 
equation  is  [11] 


x(^)  =  x(4)  +  K[zk  -  h(x(rj,x(^_,))]  (2.14) 

where  K  is  the  Kalman  gain  matrix  computed  by  the  equations  [11] 

L  =H,P(/-)H[  +  R  + 

+  H,OP(4+_,)H[_,  +  H,_1P(4+_1)®rH[  (2.15) 

K  =P(4)H[  +  OP(C,)H[_,L-1  (2.16) 

and  the  matrices  H/  and  H/(_|  are  the  first  order  linearizations  of  the  function 

h(x((p,  j))  with  respect  to  x(fp  and  xf/^,  ).  The  covariance  update  is  computed  from 

the  equation 


P(4+)  =  P(t~)  -  KLKr 

Equations  2.15,  2.16,  2.17  form  the  basis  for  the  delayed  state  Kalman  filter. 


(2.17) 
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2.8  SLAM 


Fundamentally,  a  SLAM  algorithm  attempts  to  simultaneously  solve  for  the  position 
of  a  vehicle  and  create  a  map  of  landmark  features  in  an  unknown  environment.  The 
seminal  work  in  this  area  was  proposed  by  Smith,  et  al.  [59],  who  showed  that  spatial 
information  could  be  represented  in  a  stochastic  map.  Consider  the  simplified  example  in 
which  the  position  and  orientation  of  a  robot  is  described  by  two  dimensional  position 
coordinates  (x,  y )  and  an  angle  4>  defined  as  a  rotation  about  an  orthogonal  z-axis  such  that 
the  vehicle  state  is 


x 


x  = 


<P 


(2.18) 


In  this  example,  uncertain  spatial  relationships  are  modelled  by  the  mean  and  covariance 
of  the  state  x 


X 

9 

_ 2 

or 

x  =  E(x)  = 

y 

P  =  E  [(x  -  x)(x  -  x)r]  = 

°~y 

< 

(2.19) 


The  distribution  which  assumes  the  least  information,  given  only  the  mean  and 
covariance,  is  the  normal  distribution  [59].  Thus,  the  SLAM  algorithm  can  be  developed 
as  a  Kalman  Filter.  As  a  vehicle  navigates  through  an  environment,  new  features  are 
detected  and  augmented  to  the  state  if  they  pass  some  criterion.  Features  are  usually  given 
a  position  in  the  same  coordinate  frame  as  the  filter  state  such  that  at  any  given  time  the 
state  and  covariance  of  the  filter  are 
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where  Xr  is  the  position  of  the  robot  and  xyi . . .  Xfn  are  the  positions  of  feature.  The  values 
of  the  off-diagonal  elements  in  the  covariance  matrix  depend  on  the  correlations  between 
the  state  of  the  robot  and  features  in  the  map. 

While  simple,  the  EKF-SLAM  solution  to  the  localization  problem  was  proved 
inconsistent  in  [36],  which  showed  that  even  in  a  stationary  case  with  no  process  noise,  the 
filter  becomes  inconsistent.  Inconsistency  occurs  when  the  filter’s  estimated  uncertainty 
does  not  match  the  true  error.  In  extended  Kalman  filter  (EKF)  implementations,  this  is 
due  to  the  sensitivity  of  the  Jacobian  matrix  to  noisy  observations  and  erroneous 
estimates.  Further,  it  was  shown  that  this  inconsistency  cannot  be  circumvented  by  other 
nonlinear  Kalman  filtering  methods,  such  as  the  UKF  or  IEKF  [5],  since  they  ultimately 
perform  some  level  of  linearization.  It  was,  however,  determined  by  Bailey,  et  al.  [5]  that 
inconsistency  is  fundamentally  related  to  the  ’’true”  heading  variance.  If  direct 
observations  of  the  heading  can  be  made,  so  that  heading  uncertainty  is  small,  then  the 
EKF-SLAM  algorithm  can  at  least  avoid  catastrophic  failure,  but  not  inconsistency  [5]. 
The  following  subsections  illustrate  techniques  which  seek  to  further  reduce  linearization 
errors  and  computational  complexities  in  order  to  enable  full  scale  EKF  SLAM. 

2.8.1  Submapping  Techniques. 

Local  sub-map  techniques  are  a  subset  of  the  larger  SLAM  literature  initially 
developed  to  reduce  the  computational  cost  of  implementing  full  scale  SLAM  algorithms 
in  real  time.  An  added  benefit  is  that  using  submaps  can  also  mitigate  global 
inconsistency.  In  local  submapping  algorithms  such  as  [50], [51], [13],  and  [22], 
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Figure  2.7:  An  Illustration  of  Supmapping.  A  Bayesian  representation  of  submapping 
techniques  shows  that  submaps  are  independent,  given  node  Xc  [51]. 


conditionally  independent  local  submaps  of  bounded  size  or  error  are  built  and  maintained 
in  memory.  The  global  map  can  then  be  reconstructed  by  piecing  together  these  local 
maps.  An  illustration  of  the  submapping  problem  is  shown  in  Figure  2.7. 

These  maps,  however,  are  populated  by  clusters  of  feature  points  such  as  corners, 
edges,  and  contours  which  consume  large  amounts  of  space  and  may  not  provide  a 
meaningful  reconstruction  of  the  environment  to  the  user.  Furthermore,  in  order  to  close  a 
loop,  a  re-observed  feature  must  be  matched  to  a  feature  stored  in  a  submap.  Considerable 
computation  time  can  be  consumed  in  order  to  find  an  observed  feature  buried  in  a 
submap. 
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2.8.2  Enforcing  the  Observability  Conditions. 

The  fundamental  flaw  in  EKF-SLAM  is  that  the  linearized  system  has  an  extra 
observable  direction  which  is  not  observable  to  the  true  system  [29].  This  leads  to 
inconsistency.  To  counter  this  problem,  Hesch,  et  al.  [29]  propose  a  system  which 
maintains  the  nullspace  N  at  each  timestep  /  by  enforcing  the  conditions 

N/+1  =  0,Nz  ,  H/N/  =  0,  l=l,...,k  (2.21) 

through  modifications  to  the  state  transition  matrix,®/,  and  the  measurement  Jacobian,  H/. 
Enforcing  these  conditions  prevents  information  gain  in  an  unobservable  direction  and 
greatly  improves  consistency,  at  the  cost  of  modifying  ®/  and  H  after  linearization. 

2.8.3  Inverse-Depth  SLAM. 

A  disadvantage  in  monocular  SLAM  systems  is  the  lack  of  depth  information, 
particularly  for  features  which  exhibit  low  parallax  -  meaning  that  there  is  insufficient 
change  in  the  features’  positions  when  viewed  in  multiple  frames.  As  opposed  to  classical 
monocular  SLAM  techniques,  inverse-depth  SLAM  parametrizes  each  feature  on 
initialization  with  six  states  [16] 


y.- 
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Xi  yt  Zi  6i  fi  Pi 


(2.22) 


where  x„  y„  z,i  are  the  coordinates  of  the  camera’s  optical  center;  0n  0,-  are  azimuth  and 
elevation;  and  p,  is  the  inverse  depth.  Figure  2.8  illustrates  how  features  are  parametrized 
with  inverse  depth.  Encoding  features  this  way  allows  distant  points  to  be  tracked  over 
time  and  to  immediately  contribute  to  the  SLAM  solution.  If  sufficient  parallax  occurs,  the 
feature’s  depth  uncertainty  is  reduced.  While  features  at  low  parallax  will  take  up  six 
states,  features  at  high  parallax  can  be  converted  to  the  standard  X,  Y,  Z  representation  to 
reduce  computational  burden.  This  parametrization,  however,  cannot  eliminate 
linearization  errors  caused  by  the  filter’s  dynamics  or  measurement  equations  [16]. 
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Figure  2.8:  Inverse  Depth  Parametrization.  An  illustration  of  the  inverse  depth 
parametrization  [16]. 


2.9  Pose  SLAM 

Another  pillar  in  SLAM  literature  results  from  simply  restructuring  the  vehicle’s  state 
[32]  [38]  [32]  [41].  In  the  pose  SLAM  architecture,  the  state  vector  consists  of  a  finite  or 
continuous  history  of  past  camera  poses.  Features  identified  between  camera  poses  are 
used  to  establish  constraints  between  views.  Feature  information  (or  raw  image  data)  is 
carried  forward  in  time  to  establish  relationships  between  all  poses  in  a  vehicle’s 
trajectory,  but  the  information  is  not  typically  included  in  the  system  state  vector.  Loop 
closures  can  then  be  treated  more  intuitively:  a  loop  is  closed  when  the  trajectory 
intersects  itself. 

A  key  advantage  in  pose  SLAM  is  the  sparse  nature  of  the  information  matrix  which 
relates  poses  in  the  trajectory.  The  information  matrix  scales  with  the  number  of  poses  in 
the  state  vector  and  has  non-null  entries  only  for  poses  that  are  related  [32].  This  means 
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that  the  information  matrix  is  mostly  tridiagonal  (since  consecutive  poses  are  always 
related)  with  a  small  number  of  off-diagonal  blocks  corresponding  to  loop  closures  [32]. 

The  formulation  of  the  pose  SLAM  problem  lends  itself  to  non-linear  optimization 
techniques  such  as  bundle  adjustment  [21]  [33]  [65],  which  attempts  to  minimize 
reprojection  error  of  features  tracked  over  a  finite  or  complete  set  of  images  with 
approximated  camera  poses.  Bundle  adjustment  will  be  explained  in  greater  detail  in 
Section  2.10.4.  Full  bundle  adjustment  -  using  all  camera  poses  and  corresponding 
images  -  is  typically  not  feasible  as  the  primary  estimator  for  an  online  navigation  system, 
especially  one  with  limited  computational  power.  The  following  subsections  outline 
methods  for  reducing  complexity  when  using  bundle  adjustment  to  solve  for  an  optimal 
trajectory. 

2.9.1  Reduced  Pose  SLAM. 

In  general,  not  all  poses  or  loop  closures  yield  high  information  gain.  This  motivates 
the  practice  of  ignoring  redundant  information  so  that  only  valuable  information  is 
retained  and  the  sparsity  of  the  information  matrix  is  ensured.  Konolige  and  Agrawal  [38] 
proposed  a  reduced  Pose  SLAM  system  for  indoor  navigation  based  on  skeleton  frames 
taken  at  regular  5m  intervals,  which  reduces  the  number  of  data  points  on  a  vehicle’s 
trajectory. 

Instead  of  including  frame  information  at  arbitrary  intervals,  Ila,  et  al.  [32]  showed 
that  the  information  gain  and  distance  between  poses  could  be  computed  in  closed  form 
prior  to  deciding  whether  to  include  a  pose  in  the  trajectory.  The  distance  d  between  two 
poses,  Xi  and  x„ ,  can  be  estimated  as  Gaussian  with  mean  d  and  covariance  P d 


d  =  h(xi,  xn) 


P/  = 


(2.23) 
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where  /?(■,  •)  is  the  measurement  equation  and  H  is  the  measurement  Jacobian,  partitioned 
into  the  x-,  portion  (H,)  and  the  x„  portion  (H„).  Ila,  et  al.  choose  to  marginalize  this 
multidimensional  Gaussian  for  each  of  its  dimensions,  k,  and  determine  the  probability 
that  the  distance  is  within  some  interval  \-k,  k] 


P(-K<dk<K)=  N(dk,<ri)  (2.24) 

J-K 

If  the  probability  is  less  than  a  defined  threshold  for  all  dimensions,  the  pose  is  included  in 
the  map  [32]. 

For  loop  closures,  the  information  gained  is  computed  from  the  logarithm  of  the  ratio 
of  determinants  of  prior  and  posterior  covariances  [32],  which  simplifies  to 


S  =  R  + 


H,  H, 


P  P 

x  u  K  in 


pr  p 


H,  H„ 


,  1,  ISI 

A  =  -In— 

2  R 


(2.25) 


where  R  is  the  measurement  covariance.  If  A  is  above  a  certain  threshold,  there  is  sufficient 
information  gain  from  the  candidate  loop  closure  and  it  should  be  included  in  the  map. 

2.9.2  Adaptive  Windowing. 

Another  method  for  reducing  complexity  in  the  Pose  SLAM  framework  is  to  limit  the 
number  of  poses  in  the  state.  Rather  than  using  a  fixed  window  of  poses,  which  risks 
eliminating  valuable  loop  closure  information,  an  adaptive  windowing  technique  was 
introduced  by  Lim,  et  al.  [41]  that  adapts  the  number  of  poses  in  the  state  vector  to 
minimize  information  loss  during  loop  closures  and  minimize  complexity  in  open  loop. 
Unfortunately,  for  long  trajectories,  it  is  impossible  to  avoid  eliminating  poses  which  close 
loops  unless  the  window  continues  to  grow. 
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2.10  Image  Geometry 

Image  geometry  draws  insights  from  projective  geometry.  In  the  case  of  a  single 
view,  it  will  be  shown  that  given  coordinate  information  about  an  object  in  a  scene,  the 
extrinsic  parameters  of  the  camera  viewing  the  object  (with  respect  to  the  object’s 
coordinate  frame)  can  be  extracted.  In  the  case  of  multiple  views,  it  will  be  shown  that  the 
relative  geometry  between  frames  can  be  exploited  to  determine  the  motion  of  the  sensor, 
or  sensors,  which  took  the  images. 

2.10.1  Single  View  Geometry  . 

In  Dean’s  work  [18]  the  calculation  of  vanishing  points  was  supported  by  the  known 
structure  of  indoor  environments.  This  information  was  built  into  the  system  -  i.e.  the 
system  had  no  way  of  recognizing  it  operated  in  a  specific  environment.  In  a  similar  way, 
the  perspective  n-point  (PnP)  problem  relates  known  environment  information  to  image 
data.  Let  u  and  v  be  the  pixel  locations  for  a  point  [xy  z\T  in  3-space.  Pixel  coordinates 
are  related  to  a  three  dimensional  point  by  the  equations  [58] 


fx(rux  +  rny  +  rnz  +  tx)  +  y(r2  xx  +  r22y  +  r23z  +  ty) 


D  ix  +  r32y  +  r33z  +  tz 


+  Cr 


V  =  fy 


r2\x  +  r22y  +  r23z  +  ty 


-J-  Q 

'  r3lx  +  r32y  +  r33z  +  tz 


(2.26) 


where  each  ri;  is  an  element  of  the  rotational  matrix  R  and  [tx  ty  tz]T  is  the  translation  from 
the  frame  of  the  imaged  points  to  the  camera  frame.  The  parameters  cx,  cy,  fx,  fy,  and  y 
are  known  from  camera  calibration. 

In  the  PnP  problem,  both  the  3-D  points  and  the  pixel  locations  are  known.  Since 
each  pair  («,  v)  contributes  two  equations  for  12  unknowns,  6  known  points  are  needed  to 
find  a  solution.  By  realizing  that  the  rotational  matrix  can  be  decomposed  into  three 
angular  parameters,  the  number  of  unknowns  becomes  6  and  only  3  known  points  are 
needed,  as  shown  in  [26].  However,  factoring  in  uncertainty  and  nonlinearities,  some  PnP 
algorithms  involve  at  least  4  known  points  [39]  [27]  [30]. 
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Figure  2.9:  An  Illustration  of  the  Epipolar  Constraint.  Two  camera  centers  viewing  the 
same  3D  point  lie  on  an  epipolar  plane.  Under  pure  translation,  the  point  in  the  image  will 
move  along  the  epipolar  line  [62]. 


2.10.2  Two  Views . 

Epipolar  geometry  describes  the  relationship  between  two  views  of  a  scene.  The 
study  of  epipolar  geometry  is  typically  motivated  by  finding  corresponding  points  in 
stereo  images  [28].  Let  the  coordinates  of  a  point  pM  in  a  world  coordinate  frame  be 
imaged  in  two  views  as  the  points  x  and  x'.  The  points  x,  x'  are  the  projections  of  the 
pointing  vectors  pA,  Pb,  respectively,  onto  the  images  A  and  B.  Epipolar  geometry  states 
that  these  pointing  vectors,  the  point  pv,  and  the  camera  centers  t A,  tg  are  all  coplanar,  as 
shown  in  Figure  2.9. 

For  two  calibrated  cameras,  the  normalized  points  x'  and  x  are  related  up  to  scale 
through  the  essential  matrix 


x,tEx  =  0 

E  =  [t]xR  (2.27) 
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The  decomposition  of  E  into  [t]xR  means  that  rotation  and  unsealed  translation  can  be 
extracted  from  the  essential  matrix  relating  two  views  of  a  scene. 

Much  like  the  equations  for  a  single  view,  the  elements  of  the  essential  matrix  can  be 
formed  into  a  vector  and  expressed  by  the  system  of  linear  equations 


*i*i  x'y,  x[  y\Xi  y\yi  y\  xx  yY  1 


xnxn 


Kyn  K  ynxn  ynyn  yn  xn  yn 


e  =  0 


(2.28) 


As  outlined  in  [28],  this  system  can  be  solved  by  RANSAC  with  only  7  point 
correspondences.  This  solution  is  then  further  optimized  to  yield  a  final  estimate  of  the 
essential  matrix. 

After  the  essential  matrix  is  recovered,  the  camera  matrices  representing  the  rotation 
matrix  R  and  unsealed  translation  vector  t  can  be  recovered  by  singular  value 
decomposition  [28].  First  we  define  the  normalized  camera  matrices  as 


Letting  E  have  SVD 


P  =  [I  I  0]  ,  P'  =  [R  1 1] 


(2.29) 


E  =  U£Vr 

5  0  0 


£  = 
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5  0 


0  0  0 


(2.30) 


Hartley  and  Zisserman  show  that  one  of  the  solutions  for  P',  corresponding  to  the 
reconstructed  points  being  in  front  of  both  cameras  is 
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P'  =  [UWTVT  |  u3] 
0-10 


W=  1 


0  0 


0  0  1 


(2.31) 


where  u3  is  the  third  column  of  U.  Note  that  a  property  of  the  essential  matrix  is  that  it  is  a 
singular  matrix  and  its  two  non-zero  singular  values  are  identical. 

Epipolar  geometry  was  used  in  [20]  to  constrain  the  motion  between  two  consecutive 
images  to  the  epipolar  plane.  In  other  words,  by  definition  of  the  epipolar  plane,  the 
baseline  between  two  cameras  viewing  the  same  point  must  be  contained  in  the  epipolar 
plane,  or  mathematically  [20] 


Ax  =  tb  -  tfl 

Ax  •  (pa  xpj)  =  0  (2.32) 

Diel,  et  al.  [20]  propose  a  residual  based  on  Equation  2.32  that  is  a  projection  of 
out-of-plane  translation  onto  the  epipolar  plane.  The  measurement  equation  for  the 
residual  is 


_  P a  X  P/,  _  AX 

6  llPflXpill  e*  ||Ax|| 

x  =  (I  -  exeJ)eze^Ax  (2.33) 

which  vanishes  when  the  epipolar  constraint  is  satisfied.  In  this  framework,  each 
individual  feature  matched  between  two  frames  incrementally  rotates  the  camera  baseline 
to  a  position  which  minimizes  the  residual  error. 
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2.10.3  Three  Views. 


Three- view  geometry  is  a  natural  extension  of  two  views.  A  three-view  constraint 
involves  finding  two-view  constraints  between  three  images  that  are  consistent  with  each 
other.  Three-view  constraints  were  used  by  Indelman,  et  al.  [35]  in  a  vision-aided 
navigation  system,  illustrated  in  Figure  2.10.  Using  the  notation  in  Figure  2.10,  the 
translations  T,;  must  satisfy  the  constraints  [35] 


Figure  2.10:  Three  Views  of  a  Common  Point.  Three  views  of  the  same  feature  can  be  used 
to  establish  a  three-view  constraint  on  the  three  cameras  which  viewed  the  feature  [35]. 


q[(Ti2  x  q2)  -  0 
q2  (T23  x  q3)  =  0 

(q2  x  qi)r(q3  x  T23)  =  (q,  x  T12)r(q3  x  q2)  (2.34) 

which  are  simply  two  epipolar  constraints  and  a  third  constraint  that  forces  consistency 
between  the  three  translations. 

The  algorithm  begins  by  extracting  feature  matches  from  three  overlapping  images 
using  SIFT.  Matching  is  first  performed  between  the  images  1  and  2,  and  2  and  3.  In  order 
to  robustify  these  matches,  the  fundamental  matrix  is  computed  in  a  RANSAC  approach 
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for  both  image  pairs.  Matching  is  then  performed  between  those  two  sets  to  determine 
triplet  matches  .  The  set  of  triplet  matches  are  then  used  in  an  IEKF  to  aid  an  IMU. 

2.10.4  n  Views  and  Bundle  Adjustment . 

Bundle  adjustment,  as  its  name  would  suggest,  is  the  process  of  minimizing  the 
re-projection  error  over  a  set  of  image  points  in  n  views  by  adjusting  the  three  dimensional 
coordinates  of  each  imaged  point,  the  camera  calibration  parameters,  and  the  associated 
camera  pose  (which  form  a  bundle  of  rays).  An  illustration  of  the  bundle  adjustment 
problem  for  a  single  frame  is  shown  in  Figure  2.11. 


Figure  2.11:  An  Illustration  of  the  Bundle  Adjustment  Problem.  The  goal  of  bundle 
adjustment  is  to  iteratively  refine  the  camera  position  and  3D  feature  locations  so  that  the 
rays  (shown  as  arrows)  pass  through  the  imaged  points  of  each  3D  feature. 


The  goal  is  to  find  a  vector  x,  consisting  of  all  camera  poses  P  and  calibration 
parameters  K,  and  3D  feature  locations  X,  which  maximizes  a  log-likelihood  function.  To 
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begin,  let  Z  =  {zj ,  z2, . . . ,  z„]  be  independent  measurements  assumed  to  be  from 
independent  Gaussian  pdf’s  with  means  0,  and  identical  covariances  £.  The 
measurements  are  related  to  each  mean  by  the  function  0,  =  h(K,  P,  X).  The  likelihood 
function  L(0|Z)  to  maximize,  where  0  =  {0i,  02, . . . ,  0,,},  is  [12] 


L(0|Z)  cx:  Y\  exp((-l/2)[0f  -  z/]7’E_1[0/  -  z,-]) 

i=  1 


exp 


(-1/2)  ^ [0f  -  zi]rL"1[0i  -  zf] 


i=i 


\ 


/ 


log(L(0)  cv  (-1/2)  ^[0,  -  ZifL"1!©,-  -  z ,-] 

i=i 


=  (-1/2)  ^  ||0,-  -  ZiWh  (2.35) 

i=i 

where  ||0,  -  z,]^  is  the  Mahalanobis  distance  [28].  Maximizing  the  log-likelihood 
function  is  equivalent  to  finding  a  0  such  that 


0  =  argnhn(- 1/2)^||0(- -ZiU^  (2.36) 

i=i 

There  are  many  techniques  for  solving  this  system,  the  most  basic  of  which  being 
weighted  least  squares  [65].  It’s  important  to  note  the  large  size  of  the  state  vector  being 
optimized  in  the  vision  navigation  case.  This  motivates  techniques  for  efficiently  solving 
the  least  squares  problem  which  take  advantage  of  the  sparse  nature  of  the  Jacobian  matrix 
[21]  [33]  [65]  [37], 

2.11  Stereo  Vision  SLAM 

Another  popular  approach  in  SLAM  literature  involves  measuring  the 
three-dimensional  coordinates  of  feature  points  with  a  stereo  vision  system  (SVS)  [23] 

[17]  [57]  [15].  In  a  SVS,  two  cameras  are  rigidly  mounted  to  a  vehicle  and  their  relative 
position  -  the  camera  baseline  -  and  orientation  is  calibrated.  Based  on  epipolar  geometry, 
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as  described  in  Section  2.10.2,  the  image  coordinates  of  each  feature  identified  as 
common  to  both  cameras  in  the  SVS  can  be  used  with  the  calibrated  camera  parameters  to 
determine  the  three-dimensional  location  of  a  feature.  In  a  SLAM  network,  this 
three-dimensional  feature  measurement  can  be  used  to  simultaneously  update  the 
coordinates  of  its  corresponding  feature  in  the  map  and  the  location  of  the  vehicle. 

The  accuracy  of  the  SVS  is  determined  by  the  error  in  the  calibrated  internal  and 
external  parameters.  In  [67]  it  was  shown  that  errors  in  calibration  for  the  external 
parameters,  the  relative  translation  and  rotation  between  the  stereo  camera  pair,  directly 
affect  reconstruction  accuracy  and  are  the  dominating  factors  for  determining  the  location 
of  an  object  in  three-dimensions.  Further,  it  can  be  shown  that  reconstruction  accuracy 
increases  as  the  camera  baseline  extends.  Thus,  when  using  a  SVS  on  a  single  vehicle  the 
maximum  reconstruction  accuracy  is  limited  by  the  dimensions  of  the  vehicle  which 
determines  where  the  cameras  can  be  mounted. 

2.12  Constraint  Based  Localization 

As  an  alternative  to  SLAM,  feature  correspondences  over  two  or  more  images  can  be 
used  to  set  up  constraints  on  camera  motion.  The  features  are  not  included  in  the  state  and 
therefore  the  solution  involves  less  computation. 

The  multi-state  constraint  Kalman  filter  developed  by  Mourikis  and  Roumeliotis  [48] 
uses  feature  correspondences  across  multiple  images  to  set  up  constraints  on  camera 
motion  and  IMU  estimates  in  an  EKF.  Each  time  a  new  image  is  captured,  the  state  vector 
is  augmented  with  states  representing  the  estimated  camera  position  and  attitude  such  that 
at  any  time  the  state  vector  contains  a  series  of  N  camera  poses  and  the  state  of  the  IMU. 
Each  feature  measurement  imposes  a  constraint  on  the  camera  pose  and  feature  location. 
Mourikis  and  Roumeliotis  form  the  residual  vector  [48] 
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r  =  z  -  z 


(2.37) 


r  =  HxX  +  HyPy  +  n  (2.38) 

where  HY  and  H y  are  the  Jacobians  of  the  measurement  z  with  respect  to  the  state  and 
feature  positions  respectively,  X  is  the  state,  py  is  the  error  in  the  position  estimate  of  the 
feature,  and  n  is  white  Gaussian  noise.  By  projecting  the  residual  onto  the  left  nullspace 
of  Hy,  the  residual  becomes  independent  of  the  errors  py  and  EKF  updates  can  be 
performed  without  including  the  feature  in  the  filter  state  [48]. 

2.13  Hybrid  Methods 

The  last  few  sections  outlined  two  of  the  broad  techniques  for  localizing  a  vehicle: 
SLAM  and  nonlinear  optimization  using  multi- view  constraints.  This  section  outlines 
works  which  synthesize  these  methods.  These  hybrid  methods  take  advantage  of  the 
computational  performance  of  SLAM  and  the  high  accuracy  of  nonlinear  optimization. 
While  Pose  SLAM  traditionally  allows  for  bundle  adjustment  following  completion  of  a 
trajectory,  these  methods  implement  bundle  adjustment  as  a  background  process  which  is 
able  to  integrate  its  solution  into  the  active  filtering  process. 

Perhaps  the  pioneer  work  in  this  type  of  method  is  the  dual-layer  estimator  proposed 
by  Mourikis  and  Roumeliotis  [49].  This  work  implements  their  multi-state  constraint 
Kalman  filter  as  the  primary  estimator  and  uses  a  batch  estimator  to  close  loops.  Candidate 
loop  closures  are  chosen  by  evaluating  a  distance  criterion  between  camera  poses.  The 
candidates  are  then  processed  with  image  and  inertial  information  from  memory  by  a 
bundle  adjustment  algorithm  running  in  parallel.  Reference  [49]  does  not,  however, 
succinctly  explain  how  the  bundle  adjustment  solution  is  fed  back  to  the  active  filter. 

Kaess,  et  al.  [37]  showed  that  representing  the  information  matrix  of  the  pose  SLAM 
problem  in  a  graphical  way  more  easily  shows  that  smoothing  and  mapping  can  be 
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implemented  as  separate  processes  as  shown  in  Figure  2.12.  Given  the  separator,  the 
operations  of  smoothing  and  mapping  are  clearly  independent  processes. 


Figure  2.12:  Separation  of  the  Smoothing  and  Mapping  Processes.  The  bayes  tree 
representation  of  the  information  matrix  shows  that  smoothing  and  mapping  can  be 
performed  separately  [37] 


The  algorithm,  named  incremental  smoothing  and  mapping  (iSAM),  also  has  the 
advantage  of  incrementally  solving  the  nonlinear  system.  All  past  poses  marginalized  into 
the  smoother  are  factored  into  ’’cliques”  so  that  adding  a  new  pose  only  affects  a  smaller 
subset  of  poses  related  to  the  new  pose.  This  attribute  helps  alleviate  data  latency  and 
improves  the  overall  estimation  framework. 

2.14  Extension  to  Multiple  Vehicles 

Localization  performed  by  multiple  vehicles  has  been  shown  to  have  a  substantial 
improvement  over  single  vehicle  localization  [47]  [53].  Even  a  vehicle  with  highly 
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accurate  sensors  can  benefit  from  measurements  made  by  another  vehicle  with  less 
accuracy. 

One  framework  for  multiple  vehicle  localization  is  to  navigate  in  an  alternating 
fashion  where  one  section  of  the  vehicles  fix  their  position  and  act  as  landmarks  for  the 
rest  of  the  group  to  navigate  [61]  [19].  For  vision  navigation,  viewing  vehicles  with 
known  geometry  can  remove  some  scale  ambiguity,  resulting  in  a  more  accurate 
navigation  solution.  This  ’’leap-frog”  approach  is  suitable  for  ground  vehicles  which  can 
easily  act  as  stationary  landmarks,  but  is  far  less  applicable  to  aerial  vehicles  which  are 
difficult  to  hold  in  place  without  landing. 

A  more  appropriate  framework  for  a  network  of  aerial  vehicles  is  for  all  vehicles  to 
move  continuously  while  identifying  matching  points  in  their  views.  In  [45]  [34],  passive 
measurements  of  another  vehicle  are  extracted  from  three  overlapping  views  of  the  same 
scene.  By  identifying  features  common  to  each  image,  these  systems  simultaneously  solve 
for  the  optimal  camera  orientations  and  feature  locations  in  all  three  images.  This 
requires,  however,  that  each  new  image  be  compared  to  all  images  in  a  repository  in  real 
time  to  determine  if  two  images  of  the  same  scene  exist. 

2.15  Autonomous  Navigation 

Up  to  this  point,  only  the  topic  of  localization  and  mapping  has  been  discussed.  The 
logical  next  step  from  a  reliable  navigation  system  is  to  incorporate  autonomous  command 
and  control,  as  well  as  autonomous  coordination,  synchronization,  and  consensus  in  the 
case  of  multiple  vehicles. 

At  a  fundamental  level,  vehicle  autonomy  must  achieve  active  decision  making  with 
respect  to  physical  cues.  For  example,  identification  of  the  end  of  a  hallway  might  be  used 
in  planning  a  path  for  exploration.  In  this  sense,  decision  making  is  often  reactive  which 
implies  that  autonomy  can  be  achieved  by  commanding  or  training  a  response  to  specific 
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inputs.  Autonomy  is  often  highly  dependent  on  navigation.  Autonomous  decisions  are 
frequently  tied  to  a  navigational  cue,  such  as  position,  velocity,  or  orientation. 

As  an  example,  consider  the  autonomous  navigation  of  [3].  In  this  work,  Adler  and 
Xaio  propose  an  unmanned  system  capable  of  mapping  unknown  urban  environments 
through  autonomous  path  planning  and  multiple  sensor  fusion.  While  the  mapping 
algorithm  used  in  this  system  is  made  simple  through  the  fusion  of  a  GNSS  antenna  and 
IMU,  the  point  remains  that  the  algorithm  must  have  some  sense  of  its  location.  Without 
such  information,  the  system  would  be  unable  to  determine  which  areas  had  already  been 
explored. 

2.16  Previous  Work 

In  the  previous  work  [18],  Dean  developed  a  real-time  navigation  system  to  enhance 
the  position  and  orientation  accuracies  of  an  ARDrone  quad-rotor  helicopter.  Through 
experimentation,  it  was  discovered  that  the  heading  reported  by  the  ARDrone  tended  to 
randomly  drift  during  runtime.  To  correct  this  drift,  the  navigation  system  tracked 
perspective  features  called  vanishing  points,  which  are  the  theoretical  intersection  of 
parallel  lines  at  infinity.  Because  these  vanishing  points  exhibit  little  to  no  parallax  as  the 
vehicle  moves,  they  serve  as  absolute  references  for  determining  heading.  An  EKF  was 
then  used  to  fuse  the  measurements  reported  by  the  ARDrone  with  the  vanishing  point 
computed  from  processing  a  frame  of  the  video  stream. 

2.16.1  System  Design  and  Dynamics. 

The  Kalman  filtering  system  of  Capt  Dean’s  design  was  implemented  in  off-board 
software  which  was  integrated  into  a  navigation  graphical  user  interface  (GUI).  The  GUI 
was  provided  in  a  software  development  kit  (SDK)  released  by  Parrot  SA  [1].  The  GUI 
provides  live  video,  displays  telemetry  data,  and  enables  joystick  control.  The  developed 
navigation  software  intercepts  a  video  frame  after  it  is  decoded  by  one  of  the  SDK 
threads.  Once  decoded  and  if  the  vanishing  point  algorithm  is  ready,  the  image  is 
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processed  by  the  algorithm  to  predict  the  vanishing  point.  The  predicted  vanishing  point  is 
passed  to  a  command  thread  which  computes  a  control  input  to  correct  heading  errors.  A 
block  diagram  of  the  system  is  presented  in  Figure  2.13. 


Figure  2.13:  System  Design  in  the  Previous  Work.  The  diagram  of  the  system  design  in 
[18]  shows  the  relationship  between  the  developed  software  and  the  provided  SDK. 


Due  to  limited  access  to  the  ARDrone’s  hardware  and  visual  odometry,  the  stochastic 
system  model  used  in  [18]  had  to  take  a  simplified  form.  Due  to  the  frequency  of 
measurements,  the  input  to  the  system  was  neglected  so  that  the  simplified  stochastic 
model  of  the  system  took  the  form 

x(0  =  F(f)x(f)  +  G(t)w(0  (2.39) 

so  that  the  system  dynamics  were  assumed  linear  between  measurements.  The  state  vector 
x(t)  was  composed  as 
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(2.40) 


x=  XN  Yn  5i/j  6i]/  V?  V. * 

where  XN,  YN  were  the  horizontal  position  of  the  drone  in  the  navigation  (hallway)  frame, 
Stl/,  6ij/  were  the  heading  error  and  heading  error  rate  of  the  drone-reported  heading,  and 
Vf ,  Vy  were  the  velocities  of  the  drone  in  the  navigation  frame. 

One  of  the  core  algorithms  running  in  the  software  is  the  vanishing  point  estimator. 
The  vanishing  point  estimation  algorithm  begins  when  a  new  frame  becomes  available. 
The  frame  is  processed  to  remove  distortion  and  extract  lines.  If  a  predicted  vanishing 
point  exists,  only  lines  within  five  degrees  of  the  predicted  vanishing  point  are  considered. 
Otherwise,  all  lines  are  considered  candidates  [18]. 

The  vanishing  point  algorithm  uses  a  RANSAC  algorithm  to  estimate  the  vanishing 
point  from  candidate  lines.  This  point  is  converted  to  a  unit  vector  and  passed  to  the  state 
estimator,  which  uses  the  vector  and  velocities  reported  by  the  drone  to  compute  a 
measurement  update.  The  heading  correction  from  the  measurement  update  is  converted 
into  a  control  input  and  encoded  for  execution  by  the  drone. 

2.16.2  Current  System. 

The  system’s  primary  weak-point  is  in  position  estimation.  Prior  to  testing,  strips  of 
tape  needed  to  be  added  to  the  test  course  so  that  the  drone’s  velocity  estimation  algorithm 
could  function.  Dean’s  system  uses  the  vanishing  point  algorithm  to  rotate  velocity 
measurements  reported  by  the  drone  into  the  navigation  frame.  Velocities  are  then 
integrated  into  position  estimates  through  an  EKF  update.  Figure  2.14  shows  the  results  of 
Capt  Dean’s  software. 


The  effect  of  of  the  velocity  measurements  is  evident  from  the  estimated  ground 
track.  The  velocity  measurements  are  highly  dependent  on  environment  texture  and  are 
often  inaccurate,  which  ultimately  leads  to  error  in  the  estimated  position.  Furthermore, 
the  fact  that  texture  needed  to  be  added  to  the  environment  for  the  drone’s  velocity 
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(a)  Reported  Heading  (b)  Reconstructed  Ground  Track 


(c)  Estimated  Heading  (d)  Estimated  Ground  Track 

Figure  2.14:  Results  of  the  Previous  Work.  The  results  of  [18]  show  a  great  improvement 
from  the  heading  reported  by  the  drone  (a)  and  the  heading  estimated  by  the  filter  (b). 
However,  the  inaccuracies  in  the  velocity  measurements  causes  an  error  in  the  vehicle 
position. 


estimation  algorithm  to  function  reduces  the  system’s  capability  for  autonomous 
navigation.  Much  like  Capt  Dean’s  work,  the  system  described  in  the  following  chapters 
seeks  a  more  accurate  navigation  solution  through  image  processing  techniques,  but 
without  having  to  alter  the  navigated  environment.  The  system  also  gains  the  advantage  of 
additional  information  from  a  second  drone  operating  in  its  field  of  view. 
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2.17  Summary 

This  chapter  has  provided  the  context  for  the  rest  of  this  thesis  by  highlighting  major 
areas  in  the  field  of  vision-aided  navigation.  The  next  chapter  will  draw  upon  the 
foundations  of  this  chapter  to  improve  the  navigation  system  of  [18]  and  extend  its 
operation  to  a  cooperative  network  of  vehicles. 
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III.  Methodology 


The  goal  of  this  research  effort  is  to  develop  a  system  which  is  able  to  process  data 
from  multiple  vehicles  to  accurately  determine  the  full  three-dimensional  position 
of  all  vehicles  in  the  network.  This  chapter  will  discuss  the  architecture  and  algorithms 
used  in  the  multiple  vehicle  system. 

3.1  System  Architecture 

The  system  designed  in  this  chapter  consists  of  a  single  laptop  running  centralized 
navigation  software  for  two  independently  operating  ARDrone  quad-rotor  helicopters. 
Both  ARDrones  are  able  to  communicate  with  the  ground  station  laptop  through  a 
wireless  router  which  is  connected  to  the  laptop  through  its  Ethernet  port.  The  complete 
system  setup  is  shown  in  Figure  3.1. 

Some  of  the  software  developed  takes  advantage  of  an  open-source  meta-operating 
system  called  ROS  [52].  ROS  provides  many  advantages  over  other  robotics  software, 
such  as  hardware  abstraction,  low-level  device  control,  and  package  management,  and  it  is 
designed  to  integrate  seamlessly  with  different  hardware,  programming  languages,  and  the 
like  [52], 

The  main  building  blocks  in  ROS  are  nodes  -  executable  packages  of  code  which 
perform  some  process  or  processes.  Nodes  typically  communicate  with  each  other  by 
passing  messages  through  topics.  A  node  can  either  send  messages  through  a  topic 
(publish)  or  receive  messages  from  a  topic  (subscribe).  In  this  way  nodes  are  pieced 
together  into  a  single  executable  program  through  their  common  topics.  A  more  detailed 
description  of  ROS  can  be  found  at  [2]. 
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Remote  Processing  Computer 


Figure  3.1:  System  Setup.  The  system  setup  consists  of  two  independent  vehicles 
communicating  with  a  central  processing  computer  through  a  WiFi  connection.  The  truth 
server  (optional)  is  also  connected  through  the  router. 


3.1.1  The  ARDrone  Platform. 

The  ARDrone  by  Parrot  SA  is  a  small  commercial  grade  quadrotor  vehicle.  The 
vehicle  has  a  forward  looking  high-definition  camera,  a  downward  looking  camera,  an 
ultrasonic  altimeter,  a  barometric  altimeter,  a  magnetometer,  and  a  MEMs  grade  IMU. 
Inertial,  altimeter,  and  image  information  is  fused  in  the  vehicle’s  on-board 
microprocessor  to  produce  an  estimate  of  attitude  and  altitude.  The  downward  looking 
camera  is  used  to  measure  the  vehicle’s  velocity  using  optical  flow.  The  telemetry  data  is 
combined  with  the  compressed  video  streams  and  status  messages  and  broadcast  over  a 
wireless  link  [1], 

An  SDK  for  a  desktop  computer  is  also  provided  by  Parrot  SA  to  give  developers 
access  to  video  and  telemetry  data  [1].  In  addition,  the  manufacturer  does  not  support 
compiling  software  on  the  ARDrone’s  on-board  processor.  Therefore,  all  algorithms 
developed  in  this  research  were  implemented  on  a  laptop. 
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3.1.2  Hardware  and  Software. 

The  computer  used  to  develop  the  navigation  software  is  a  HP  EliteBook  8560w 
laptop.  The  computer  has  an  Intel  Core  i7  quad-core  processor  running  at  2.2  GHz,  16 
gigabytes  of  RAM,  and  a  256  gigabyte  solid  state  disk  drive.  The  operating  system  used  is 
Ubuntu  Linux  12.04,  running  version  3.20-35  of  the  Linux  kernel.  The  navigation 
software  uses  the  computer  vision  library  OpenCV  version  2.4.2,  the  ROS  Groovy 
distribution,  and  MATLAB  version  R2012a. 

3. 1.2.1  ROS  Packages. 

A  number  of  packages  used  in  this  system  are  available  on-line.  These  packages 
include:  the  ROS  ARDrone  driver,  developed  in  the  Autonomy  Lab  of  Simon  Eraser 
University  [46];  the  Vicon  bridge  package;  the  joystick  driver  package;  and  the  ROS 
camera  calibration  package. 

The  ROS  ARDrone  driver  wraps  around  the  official  ARDrone  SDK  version  2.0  [1]  to 
enable  communication  with  the  ARDrone  in  standard  ROS  style  -  through  messages, 
parameters,  and  services.  All  data  streams  decoded  by  the  SDK,  including  video  and 
navigation  data,  are  converted  into  separate  ROS  topics.  Likewise,  all  parameters  specific 
to  the  ARDrone,  such  as  control  gains,  are  loaded  into  the  ROS  parameter  server  under  the 
driver  node’s  namespace.  In  addition,  the  ARDrone  driver  subscribes  to  a  generic 
command  topic  which  allows  greater  flexibility  in  the  device  used  for  manual  control. 

3. 1.2.2  Multiple  Vehicle  Network. 

Built  into  the  ROS  ARDrone  driver  is  the  capability  to  specify  an  IP  address  of  the 
ARDrone  controlled  during  runtime.  This  capability  enables  control  of  multiple  vehicles 
by  simply  running  two  instances  of  the  ARDrone  driver  with  separate  IP  addresses  and 
forcing  connection  to  a  central  network. 

On  start-up,  each  ARDrone  is  automatically  configured  to  have  a  pre-specified  IP 
address.  To  change  the  IP  and  force  the  ARDrone  to  connect  to  a  managed  network,  an 
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executable  file  was  written  in  its  memory  which  disables  the  old  wireless  link,  configures 
the  new  IP,  and  forces  connection  to  the  router  network.  Once  the  executable  file  is  run, 
each  ARDrone  is  accessible  through  the  router. 

3.1.3  Concept  Diagram. 

Figure  3.2  outlines  the  conceptual  flow  of  the  algorithms  in  this  thesis.  Each  block  in 
the  concept  diagram  represents  an  algorithm  which  takes  specific  inputs  and  transforms 
them  into  an  output.  The  final  output  in  the  concept  diagram  is  the  navigation  solution, 
which  estimates  the  current  state  of  both  vehicles.  The  remainder  of  this  chapter  will 
describe  each  block,  its  inputs,  and  its  outputs  in  detail. 

3.2  Target  Tracking 

The  goal  of  target  tracking  is  to  estimate  the  image  coordinates  of  each  of  the  four 
targets  {s^x ,  s^lx,  e^x,  s%lx}  from  a  trail  camera  image.  The  target  locations  will  be  used  to 
estimate  relative  camera  pose  and  aid  in  feature  correspondence.  Target  tracking  consists 
of  identifying  a  bounding  pixel  region  which  contains  a  target,  determining  the  pixels 
which  belong  to  the  target,  and  approximating  the  center  of  the  target.  Target  tracking 
assumes  that  the  amount  of  motion  between  images  is  small  and  such  that  the  bounding 
region  will  contain  the  target’s  new  location  in  the  next  frame. 

3.2.1  Vehicle  Targets. 

The  vehicle  targets  used  in  the  target  tracking  algorithm  are  ultra-bright  LEDs  housed 
inside  colored  table  tennis  balls.  The  table  tennis  ball  material  acts  as  a  diffuser  to  convert 
a  uni-directional  LED  into  an  omni-directional  light  source.  The  fashioned  targets  are 
shown  in  Figure  3.3.  The  high  intensity  of  the  markers  make  them  easier  to  identify  in  an 
image,  while  also  reducing  the  algorithm’s  sensitivity  to  ambient  lighting. 
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Figure  3.3:  Tracking  Vehicle  Targets.  The  LED  markers  on  the  leading  ARDrone  vehicle 
are  easily  extracted  from  the  surrounding  image.  The  locations  of  each  target  are  found 
by  determining  the  pixels  within  the  color  range  of  a  target  and  finding  its  central  spatial 
moment. 


3.2.2  Tracking  Loops. 

Each  target  on  the  lead  vehicle  is  given  its  own  square  region  of  interest  (ROI),  as  can 
be  seen  in  Figure  3.3.  The  ROI  is  made  small  enough  to  cut  out  as  much  of  the 
surrounding  image  as  possible,  but  large  enough  to  allow  for  movement  of  the  target 
between  frames. 

Each  bounding  ROI  can  be  initialized  from  a  ROS  service  [2]  which  prompts  the  user 
to  left-click  each  target  in  the  correct  order.  The  order  of  the  targets  must  correspond  with 
the  order  of  the  targets’  world  frame  coordinates  in  memory:  starting  from  the  top  left 
target  and  proceeding  counter-clockwise.  The  bounding  ROI  is  computed  for  the  next 
frame  based  on  the  target  area  in  the  current  frame.  As  targets  move  farther  away,  the 
target  area  location  becomes  more  sensitive  to  changes  in  camera  orientation.  Therefore, 
the  computed  ROI  is  proportional  to  the  inverse  area  -  as  the  target  gets  smaller,  the  ROI 
becomes  larger. 
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For  each  target,  its  center  s'"x  is  found  by  processing  its  corresponding  ROI.  First, 
each  pixel  in  the  image  region  is  converted  from  RGB  to  HSV  encoding.  HSV  encoding 
represents  a  pixel’s  color  by  a  three  dimensional  vector:  hue  (H);  saturation  (S);  and  value 
(Y),  or  brightness.  Next,  the  pixels  which  belong  to  the  target  are  determined  by  their 
color.  Finally,  the  center  of  the  target  is  found  by  determining  the  center  of  mass  of  the 
blob  of  pixels  which  belong  to  the  target. 

3.3  Relative  Camera  Pose  Estimation 

Relative  camera  pose  estimation  is  used  to  determine  the  translation  tj  and  orientation 
R'  between  the  leading  and  trailing  vehicles’  cameras  from  the  image  coordinates  of  each 
target  {spxlx,  s%IX,  s^lx,  v},  the  measured  coordinates  of  each  target  in  the  lead  camera 
frame,  and  the  calibration  parameters  for  the  trail  camera.  The  vector  t\  represents  the 
relative  position  from  the  origin  of  the  trail  camera  frame  to  the  origin  of  the  lead  camera, 
expressed  in  the  trail  camera  frame.  The  matrix  Rj  is  a  DCM  which  transforms 
coordinates  in  the  trail  camera  frame  to  the  lead  camera  frame. 

The  locations  of  each  target  in  the  trail  vehicle’s  image  are  matched  up  to  the  true 
target  locations  expressed  in  the  lead  vehicle’s  camera  frame,  which  have  been  determined 
ahead  of  time.  Each  imaged  target  s'"'1 ,  true  target  location,  and  the  calibration  parameters 
of  the  trail  camera  can  be  used  to  form  the  equations  shown  in  Section  2.10.1,  where  the  x 
and  y  pixel  locations  of  the  imaged  target  correspond  to  the  variables  u  and  v  respectively. 
The  equations  for  the  four  targets  form  a  overdetermined  PnP  problem  which  can  be 
solved  using  non-linear  techniques. 

The  non-linear  solver  used  in  this  work  is  OpenCV’s  solvePnP  function.  The 
solvePnP  function  can  be  configured  to  run  in  three  modes  of  operation:  CV_P3P,  which 
runs  the  algorithm  in  [26];  CV2EPNP,  which  runs  the  algorithm  found  in  [39];  or 
CV.ITERATIVE,  which  is  a  custom  Levenburg-Marquardt  optimization  routine.  The 
function  can  also  be  used  in  a  custom  RANSAC  mode,  where  any  of  the  above  three 
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methods  can  be  used  as  the  PnP  solver  [10].  For  this  work,  the  CV.ITERATIVE  mode  will 
be  used. 

3.4  Feature  Correspondence 

The  goal  of  feature  correspondence  is  to  find  a  set  of  vectors  f;  which  represent  the 
pixel  coordinates  corresponding  to  a  single  feature  identified  in  four  images.  The  j-th 
feature  vector  can  be  written  as 


f j 


xf(k)  xjf(fc)  xjf(k-l)  x%*(k-  1) 


(3.1) 


where  /  and  t  denote  the  lead  and  trail  pixel  frames,  k  is  the  current  time  epoch,  and  k—  1 
is  the  previous  time  epoch.  Feature  correspondence  involves  feature  extraction  (i.e. 
detection),  description,  and  matching,  as  detailed  in  Section  2.5. 

Initial  feature  correspondence  determines  matches  between  the  lead  and  trail  camera 
images  at  the  same  time  epoch.  It  is  performed  by  OpenCV’s  feature  detector,  descriptor 
extractor,  and  descriptor  matcher  objects.  OpenCV  offers  a  number  of  traditional  methods, 
such  as  SIFT  [43]  and  SURF  [6],  as  well  as  some  of  the  newly  developed  feature  detection 
algorithms,  such  as  ORB  [55],  and  binary  descriptor  extractors,  such  as  BRISK  [40]  and 
FREAK  [4].  The  matching  algorithms  include:  brute  force,  where  matches  are  determined 
strictly  by  descriptor  distance;  and  k-nearest-neighbor,  where  each  descriptor  may  have  up 
to  k  matches.  The  detector,  descriptor,  and  matcher  combination  used  in  this  research  will 
be  discussed  in  Chapter  4. 

After  initial  feature  correspondence,  the  raw  matches  are  filtered  using  an  estimate  of 
the  fundamental  matrix  F,  computed  by  the  equation 


f  =  (TfrrE(T  rr1 


(3.2) 
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where  T’’lx  and  Tpclx  are  the  camera  calibration  matrices  for  the  lead  and  trail  vehicles, 
respectively,  and  E  is  computed  using  t\  and  Rj  as  shown  in  equation  2.27.  Each  pairing 
[x^fr),  xr’lx(T)}  is  validated  by  the  constraint  equation 

[xf  f  (r)Fxf  (r)  <  a  (3.3) 

where  xp"  denotes  pixels  in  homogeneous  coordinates  and  a  is  a  threshold  below  which 
points  are  considered  valid. 

To  determine  matches  across  time,  an  initial  set  of  matches  is  determined  by 
matching  the  descriptors  for  the  points  extracted  from  the  leading  camera  at  times  k  and 
k  -  1.  A  homography  is  then  computed  with  this  initial  set  of  matches  in  a  RANSAC 
approach  [10]  to  get  rid  of  erroneous  matches.  Now  each  identified  feature  has  image 
coordinates  in  four  images.  Figure  3.4  shows  an  example  of  the  results  of  feature 
correspondence.  It  can  be  verified  by  eye  that  the  matches  corresponding  to  a  single 
feature  are  consistent  across  cameras  and  across  time  epochs. 

3.4.1  Image  Masking. 

An  image  mask  is  a  binary  matrix  of  the  same  size  and  shape  as  its  corresponding 
image  which  indicates  the  regions  in  the  image  that  should  be  processed  for  features. 
Image  masks  were  used  to  mitigate  erroneous  feature  matches  in  the  feature 
correspondence  algorithm  and  to  prevent  feature  matches  which  would  not  be  localized 
well  by  the  triangulation  algorithm.  In  a  hallway,  features  tend  to  be  densely  packed  on 
the  outsides  of  the  image:  e.g.  pictures,  doors,  fountains,  etc.  Features  at  the  end  of  the 
hallway  tend  to  be  harder  to  localize.  In  addition,  for  the  trail  vehicle,  part  of  its  view  does 
not  intersect  the  lead  vehicle’s  view. 

Constructing  the  image  mask  for  each  image  takes  into  account  the  estimated 
vanishing  point  for  each  image  and  the  imaged  vehicle  target  coordinates,  in  the  case  of 
the  trail  camera  image.  The  end  of  the  hallway  can  be  estimated  as  a  region  surrounding 
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Figure  3.4:  Results  of  Feature  Correspondence.  The  image  is  a  stitching  of  the  images  of 
the  leading  (left)  and  trailing  (right)  vehicles  at  the  current  time  (top)  and  the  previous  time 
(bottom)  epochs. Green  lines  indicate  a  match  across  cameras.  Red  lines  indicate  a  match 
across  time  epochs. 


the  vanishing  point  in  each  image.  The  section  of  the  trail  camera  image  which  is  not  in 
the  lead  camera’s  view  can  be  estimated  from  the  first  and  fourth  vehicle  target  (refer  to 
Figure  3.3).  The  image  area  below  these  targets  is  not  in  view  of  the  lead  vehicle.  Figure 
3.5  shows  the  result  of  applying  the  constructed  image  masks  to  the  lead  and  trail  camera 
images. 


Figure  3.5:  Image  Processing  Using  Masks.  After  applying  a  mask  to  an  image,  the  areas 
that  are  not  of  interest  are  black.  Only  the  areas  outside  the  mask  will  be  processed  for 
features  to  match 
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The  benefit  of  using  image  masks  is  that  it  increases  the  probability  of  finding  valid 
feature  matches  that  will  be  well  localized  in  the  triangulation  algorithm  which  follows 
(see  Figure  3.2).  The  image  masks  ensure  that  the  putative  set  of  matches  determined  by 
the  feature  matching  objects  are  more  consistent  with  the  overlap  between  the  two  images 
and  eliminate  some  false  matches  before  attempting  to  match  across  time  epochs. 
Eliminating  false  matches  during  feature  correspondence  is  preferred  over  eliminating 
false  matches  in  later  processes  (triangulation  and/or  batch  processing).  It  is  possible  to 
have  a  match  which  is  not  consistent  with  the  image  overlap,  but  satisfies  the  epipolar 
constraint  and  appears  well  localized.  These  types  of  matches  are  much  more  difficult  to 
eliminate  after  the  feature  matching  process,  which  makes  computing  the  image  masks  an 
important  step  in  feature  correspondence. 

3.5  Feature  Triangulation 

Feature  triangulation  uses  each  pair  of  lead  and  trail  pixel  coordinates  contained  in 
each  vector  f;,  and  the  relative  translation  t\  and  relative  orientation  DCM  R'  (the 
transpose  of  R|)  for  each  time  epoch  to  determine  the  coordinates  of  each  feature  M ;  in 
four  frames  of  reference.  These  four  frames  of  reference  are  the  lead  and  trail  camera 
frames  for  times  k  and  k—  1 .  While  the  coordinates  of  a  feature  referenced  to  the  trail 
vehicle  at  time  k—  1  will  be  the  only  coordinates  used  in  the  next  block,  batch  processing, 
triangulating  features  in  all  frames  of  references  is  another  means  of  eliminating 
erroneous  feature  correspondences. 

In  the  ideal  case,  any  two  matching  image  points  would  correspond  to  two  camera 
frame  pointing  vectors  which  intersect  at  a  distinct  point  in  3-space.  However,  due  to 
errors  in  feature  correspondence  and  relative  camera  pose  estimation,  this  is  hardly  the 
case.  Additionally,  the  feature  correspondence  algorithm  detailed  in  section  3.4  will 
occasionally  permit  false  matches. 
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3.5.1  Linear  Triangulation . 

While  optimal  algorithms  exist  for  finding  the  most  probable  point  in  3-space,  the 
triangulation  method  developed  in  this  section  remains  linear.  The  algorithm  starts  with 
two  matched  pointing  vectors  in  the  lead  and  trail  camera  frames  U/,  u,  which  are  found  by 
applying  the  transform 


i!  =  (Tf  )rxf  (3.4) 

where  x^x  is  the  pixel  frame  location  of  the  j-th  imaged  feature  from  camera  i  in 
homogeneous  coordinates.  Equation  3.4  is  essentially  a  transform  from  the  pixel  frame  to 
the  camera  frame  without  applying  depth.  Let  the  epipolar  plane  II  be  spanned  by  ut,  t\ 
and  take  the  trail  camera  frame  as  the  frame  of  reference.  The  projection  of  Ui  onto  the 
epipolar  plane  is  then  [60] 


n  = 


u 


u;  =  (irnpirifyi, 


(3.5) 


Now  the  three  vectors  u,,  u',  it  lie  in  a  plane  and  satisfy  the  linear  equation 


st  ■  uf  —  si  ■  U;  =  t't  (3.6) 

where  s /  and  st  are  the  estimated  depths  to  the  triangulated  point  in  the  lead  and  trail 
camera  frames  respectively.  Equation  3.6  is  an  overdetermined  system  with  one  exact 
solution  which  is  easily  solved  by  QR  decomposition  [60].  Solutions  which  return 
negative  can  be  discarded  as  they  are  not  feasible  (i.e.  a  feature  cannot  be  behind  the 
camera). 
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3.5.2  Measures  of  Accuracy. 

False  matches  can  be  removed  during  triangulation  by  thresholding  the  projection 
error  and  parallax  angle  of  the  corresponding  feature  points.  The  projection  error 
associated  with  triangulation  is  calculated  by  [60] 

e  =  ll(i  -  (nrn)_1nr)Rju/||  (3.7) 

which  determines  how  much  the  pointing  vectors  needed  to  be  adjusted  to  force  their 
intersection.  It  was  empirically  determined  that  feature  correspondences  with  more  that 
0.05  units  projection  error  could  be  considered  erroneous.  Parallax  angle  refers  to  the 
angle  between  the  two  pointing  vectors  for  a  single  feature.  After  ur  is  projected  onto  the 
epipolar  plane,  the  parallax  angle  can  be  estimated  by 

6  =  arcsin(||u,  x  U;||)  (3.8) 

Parallax  angle  will  be  further  investigated  in  Chapter  4. 

3.6  Batch  Processing 

The  purpose  of  the  batch  processing  algorithm  is  to  create  a  single  measurement  from 
the  outputs  of  feature  correspondence,  relative  camera  pose  estimation,  and  feature 
triangulation,  as  shown  in  Figure  3.2,  which  can  be  fed  directly  into  a  Kalman  filter.  The 
batch  processor  designed  in  this  section  is  an  iterative  non-linear  model  similar  to  the 
construction  of  the  bundle  adjustment  problem  in  Section  2.10.4.  The  advantage  of  the 
batch  processor  over  handling  each  input  individually  is  that  by  incorporating  all  inputs  at 
once,  outliers  can  be  handled  before  filtering.  In  addition,  fewer  measurements  must  be 
eventually  processed  by  the  Kalman  filter.  The  obvious  disadvantage  is  the  processor’s 
iterative  solution. 
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A  frame  of  reference  for  the  batch  processor  must  first  be  defined.  Since  the 
information  given  to  the  batch  processor  will  be  used  to  set  up  equations  relating  the  lead 
and  trail  vehicle  positions  between  times  k  and  k  -  1,  a  local  frame  of  reference  must  be 
chosen.  The  body  frame  of  the  trail  vehicle  at  time  k  -  1  will  be  the  reference  frame  for 
the  batch  processor  and  will  be  denoted  by  a  superscript  w.  The  body  frame  for  the  other 
three  vehicle  positions  in  the  measurement  processing  algorithm  can  be  related  to  the 
measurement  processor  frame  by  the  equations 


pw  r^bt(k- 1) 


r^w  _  r^wr^n 

—  K'ny'bi(T) 


(3.9) 


where  Z?,(r)  denotes  the  body  frame  of  vehicle  i  at  time  r.  Each  DCM  on  the  right  side  of 
Equation  3.9  is  computed  using  the  estimated  roll  and  pitch  0,(r)  angles  for  each 
vehicle  at  time  r,  which  comes  from  each  vehicle’s  navigation  data  stream,  as  well  as  the 
corrected  yaw  angle  i/r,(r)  for  each  vehicle  at  time  r,  which  comes  from  the  vanishing 
point  algorithm. 

Each  quantity  is  related  to  the  state  variables  by  an  equation  of  the  form 


z(t)  =  h(pM(r),  d¥(r),  MH)  +  v(r)  (3.10) 

where  z(t)  is  a  realized  measurement,  h(p"'(T),  d*P(r),  MH  )  is  a  predicted  measurement, 
pw(r),  dT(r),  M"  are  the  position(s),  attitude  error(s),  and  feature  coordinates  respectively 
and  v(r)  is  zero-mean  white  Gaussian  noise.  The  noise  strength  R  is  defined  for  each 
input  as 


R  =  E[v(r)vr(r)]  (3.11) 

where  r  denotes  time. 
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3.6.1  Feature  Points . 


The  four  image  coordinates  contained  in  each  f;  (Section  3.4)  are  assumed 
independent  and  can  be  handled  individually.  Therefore,  each  feature  point  can  be 
expressed  as 


x^(r)  =  V(p  J(r),  dT,(r),  MJ)  +  V/(r)  (3.12) 

where  hf(pu'(r),  c)*P,(t),  Mlv),  xpix(t)  are  the  predicted  and  realized  image  coordinates  of 

J  l  J  IJ 

the  j- th  three-dimensional  feature 


MJ  = 


jyw  yvv  ^\v 


(3.13) 


for  the  /'-th  vehicle  at  time  r.  The  predicted  image  coordinates  h/(pj(r),  £*F,(t),  M"')  can 
be  calculated  by  the  matrix  equations  [66] 


s0'(r)  =  C£'C^(T)(MJ  -  p-'(r)) 

h/(pr(r),  SViir),  MJ)  =  — i—  Tf  sg(r)  (3.14) 

LSijVTiJz 

where  sfj(r)  is  the  pointing  vector  in  the  /'-th  vehicle’s  camera  frame  and  pj(r)  is  the 
position  of  the  /-th  vehicle  at  time  r  in  the  measurement  processor  frame  w.  Note  that  the 
body  to  camera  frame  DCM  for  each  vehicle  CJ  does  not  change  over  time,  hence  the 
absence  of  r.  The  subscript  z  ([•]-)  denotes  the  z  component  of  the  sj(r)  vector.  The 
corrected  w- frame  to  body  frame  DCM  CfJTI  is  computed  using  the  attitude  errors  for 
vehicle  i  at  time  r  via  the  equation 

Cj(T)  =  Cjfr)[I  +  dT((r)x]  (3.15) 
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where  5*P,(r)  is  a  vector  of  the  attitude  errors  for  vehicle  i  at  time  r.  The  influence 
matrices  associated  with  h/'(p“’(T),  <SYj(T),  MJ)  are  found  using  chain  rule  derivatives  as 
[66] 


dhf(p  ”(t),<^(t),MJ) 
dp  {(t) 

dSJ(T)  =  _CCtbM 

dp i(T)  b<  w 

W) 


J pix  . 


.  asc,(r)  .  asCi(r) 


-  (9pi(r) 


[sy  (r)] 


^(r) 


_  r.  ds  .‘(r) 

_  [S/(T)]^(,5T,(r))  "  Si^T^aOT,(r))]z 


[s}'(t)] 


5sc/(r) 

7  -  -c;;;ct(T,[(Mj  -  prolix 


H Mjfjij)  - 


d(.6V,iT))  ~bi  w 

dhfWiTXSYiiTXMp 


dM'J 

dscl(T) 

J  _  r^ci  r'bi(j) 

5M'!'  ~  bi  w 


_  rppiX 

~  1  Ci 


r  ds‘(r)  r  SsHt) 

[s  /  (T)]z  s  /  (T)  L  ~ik^  1 


dMJ 


(3.16) 


(3.17) 


(3.18) 


Once  again,  the  subscript  z  ([•]')  denotes  the  z  component  of  the  vector.  The  error 
covariance  for  each  feature  point  in  units  of  pixels2  is 


R/  = 


4  0 
0  4 


(3.19) 


The  error  covariance  values  were  chosen  arbitrarily,  but  were  found  not  to  have  a 
significant  impact  on  solution  accuracy  if  changed.  This  is  because  of  the  high  accuracy  of 
the  relative  camera  position  estimates,  which  are  given  a  significantly  higher  weight  in  the 
batch  processing  algorithm  such  that  the  feature  point  error  covariance  becomes  arbitrary. 

3.6.2  Relative  Camera  Position  . 

The  relative  camera  position  estimates  only  take  into  account  the  relative  translation 
between  the  lead  and  trail  cameras  at  time  r.  Therefore,  each  relative  camera  position  can 
be  expressed  in  meters  as 
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t|(r)  =  hA(p?'(T),pr(T),(5V,(T),<yTf(T))  +  Va(t) 


(3.20) 


where  tj(r)  comes  from  relative  camera  pose  estimation  (Section  3.3).  The  predicted 
relative  camera  position  estimate  hA(p”'(r),  p"'(r),  c)'F/(t),  <5T,(t))  is  calculated  from  the 
positions  and  attitude  errors  of  both  vehicles  by  the  equation 

hA(p;i'(r),  Pr(r),  d'f*/(r),  SVt(r))  =  Cc‘tb^\p^(T)  -  p^r))  (3.21) 

where  Cf/r)  was  defined  in  Equation  3.15  and  C£'  will  be  described  in  Section  3.8.2.  The 
vectors  p and  p)v(r)  represent  the  position  of  the  lead  and  trail  vehicles  respectively  at 
time  r  expressed  in  the  batch  processor  frame.  This  equation  is  nonlinear  with  respect  to 
the  attitude  errors  d*F/(r),  d'F;(r).  The  influence  matrices  associated  with 
hA(pJv(r),  p)'  (r),  ()'F/(r),  civF,(T))  are  found  via  chain  rule  as 


tt  _  ahA(pr(T),pr(r),tfTKr),^(T)) 

pi aKT)  ^  .  s  —  '-'u 


5pi(r) 


'b,  w 


Hja(t)  =  ^hA(pr(r),pr(r  ),dY,(r),fflFf(r))  = 


(t) 


'b,  w 


dhA(p7(r),  p^(r),  SVt(r)) 

W)  - - d(6Vt(T)) - _  Cb,C™  [p' (r)  "  Pf  (t)]> 


H'f,a(t)  -  0 


(3.22) 

(3.23) 

(3.24) 

(3.25) 


The  error  covariance  for  each  relative  camera  position  estimate  RA  will  be  characterized  in 
the  next  chapter. 

3.6.3  Batch  Solver. 

The  state  of  the  batch  estimator  contains  the  position  and  attitude  errors  for  the  lead 
and  trail  vehicle  at  the  current  time  k-,  the  position  and  attitude  errors  of  the  lead  vehicle  at 
the  previous  time  k—  1;  and  the  features  such  that  the  state  can  be  written  as 
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X  = 


(3.26) 


p  J(k)  5'¥l(k)  pm  d'Yi(k)  pJT(A:  -  1)  5V,(k-  1)  M“’  ...  MJ 

The  body  frame  of  the  trail  vehicle  at  time  k  -  1  is  the  reference  frame  for  the  state  of  the 
batch  estimator  and  the  position  of  the  trail  vehicle  at  time  k—  1  is  the  origin.  The  batch 
estimator  seeks  a  solution  vector  x  which  maximizes  the  distribution  [28] 


p(x\z)  =  (27 r)  2£z2exp  (~[f(x)  -  z]rEz  1  [f (x)  -  z] 


where  z  is  a  vector  of  relative  camera  positions  and  feature  points 


t \(k) 
t\(k  -  1) 


z  = 


<x(k) 


1) 


f(x)  is  a  vector  of  the  predicted  relative  camera  positions  and  feature  points 


(3.27) 


(3.28) 


f(x) 


hA(p7(k),pm,6'ill(k),6'¥t(k)) 
hA(P;i'(k  -  1),  p ?(k  -  1),  5V,(k  -  1),  5^t(k  -  1)) 
h/(pf(r),  dT/(r),  M”) 


(3.29) 


and  Ez  is  a  Gaussian  covariance  representing  the  error  between  z  and  f(x)  and  is  formed  by 
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£z  = 


(3.30) 


R/ 


The  solution  vector  can  be  expressed  as  a  perturbation  Ax  about  a  nominal  state  x0 


x  =  x0  +  Ax  (3.31) 

Using  equation  3.31,  the  nonlinear  function  f(x)  can  be  expanded  using  the  Taylor  series 
to  form 


f(x)  =  f(x0)  +  JAx  (3.32) 

where  J  is  the  Jacobian  of  f(x)  evaluated  at  the  nominal  state.  The  Jacobian  is  constructed 
in  pieces  using  the  influence  matrices  outlined  in  Sections  3.6.1  and  3.6.2.  As  an  example, 
for  two  relative  camera  position  estimates  (for  time  k  and  k—  1)  and  a  single  feature,  the 
Jacobian  has  the  form 

Hwa(&)  0  UPtA(k)  H,,A(k)  0  0  0 

0  0  0  0  H piA(k  -1)0  0 

j  Hw/n(fc)  Hip  ,fn(k)  0  0  0  0  HM]jn(k) 

o  0  H  Ptftl(k)  Hip  ,fn(k)  0  0  Hv/,/;,  (k) 

0  0  0  0  Hpifn(k  -  l)  Hip;y(] (/:  -  1 )  HMlfn(k  -  1) 

0  0  0  0  0  0  HMl/tl(fc-l) 

(3.33) 

Using  Least  Squares  techniques,  equation  3.32  can  be  rearranged  to  form  the  normal 
equations  [60] 
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(Jr2:z-1J)Ax  =  Jr2^(z  -  f(x))  (3.34) 

and  the  system  of  equations  can  be  solved  to  find  Ax.  At  each  iteration,  the  resulting  Ax  is 
added  to  x  until  the  vector  converges,  at  which  point  the  distribution  in  Equation  3.27  is 


maximized. 


3.6.3. 1  Levenburg-Marquardt  Solution. 


The  Levenburg-Marquardt  method  introduces  an  augmentation  to  the  normal 
equations,  so  that  they  become  [28] 


(3.35) 


and  the  system  is  solved  in  same  way  as  before.  The  key  to  the  Levenburg-Marquardt 
method  is  in  the  update  of  the  A  parameter.  Although  there  are  many  argued  best  methods 
for  updating  this  parameter,  the  method  chosen  in  this  work  is  simple,  and  follows  the 
method  outlined  in  [28].  If  Ax  increases  with  iteration,  increase  A  by  a  factor  of  10  and 
recompute  Ax  without  recomputing  the  Jacobian  J  and  prediction  f(x).  If  Ax  decreases 
with  iteration,  reduce  A  by  a  factor  of  10  and  continue. 


3. 6.3. 2  Robust  Cost  Functions. 


Built  into  the  Levenburg-Marquardt  method  is  the  minimization  of  the  squared  norm 
of  the  residuals 


Si  =  Z i  -  f;(x) 


(3.36) 


where  d,  is  the  residual  for  the  z'-th  measurement  z(.  Minimizing  the  squared  norm  is 
sufficient  when  the  data  is  outlier- free.  However,  the  feature  correspondence  algorithm 
developed  in  Section  3.4  and  triangulation  algorithm  developed  in  Section  3.5  will 
occasionally  forward  large  outliers  to  the  batch  processor.  It  is  therefore  necessary  to 
weight  the  residual  vector  using  a  cost  function  which  is  more  robust  to  outliers.  The  cost 
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function  used  in  this  thesis  is  the  Huber  cost  function,  which  de- weights  the  cost  of  outlier 
residuals  according  to  the  equation  [28] 


COW  = 


<5?  :  m  <  b 

2b\dj\  -  b 2  :  otherwise 


(3.37) 


where  b  is  approximately  the  outlier  threshold. 

The  cost  function  can  be  integrated  into  the  Levenbur-Marquardt  algorithm  by 
weighting  the  residuals  for  each  measurement  by  [28] 


C(||d,]|)(1/2) 


S'  =  WjSj  (3.38) 

and  the  new  residual  vector  is  formed  by  concatenating  all  of  the  6” s. 

3.6.4  Kalman  Filter  Measurement  Formation  . 

The  result  of  the  batch  processor  at  this  point  are  a  nominal  state  x  and  the  Jacobian  J 
evaluated  at  the  nominal  state.  The  first  order  approximation  for  the  covariance  of  the 
nominal  state  is  simply  [28] 


Pxx  =  (JrEz-|J)t  (3.39) 

where  t  denotes  the  pseudo-inverse  in  the  case  that  JrE“ 1 J  is  not  invertible. 

The  measurement  forwarded  to  the  Kalman  filter  is  determined  by  a  linear 
transformation  of  the  nominal  state  x,  excluding  the  feature  states 
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P  ?(k) 

1  BATCH  =  p ™(k)  ~  Tx 


(3.40) 


prc*-D 


I  0  0  0  0  0 


T=  0  0  I  0  0  0 


(3.41) 


0  0  0  0  1  0 


Similarly,  the  covariance  of  the  state  Pxx,  excluding  all  terms  corresponding  to  the  feature 
states,  can  be  transformed  by  the  equation 


(3.42) 


and  the  measurement  can  now  be  fed  into  the  Kalman  filter. 

3.7  Navigation  Kalman  Filter 

The  navigation  Kalman  filter  developed  in  this  work  fuses  measurements  from  the 
batch  processor  and  vanishing  point  algorithm  with  a  linear  system  model  in  order  to 
produce  an  estimate  of  the  position  of  both  vehicles  in  the  navigation  frame.  The  roll  and 
pitch  angles  0,-(r),  0,-(r),  and  the  corrected  yaw  angle  iA,(r)  are  used  to  rotate  the 
measurement  from  the  batch  processing  algorithm  into  the  navigation  frame. 

3. 7. 1  System  Model. 

The  system  model  for  the  two  vehicle  network  is  a  linear  stochastic  differential 
equation  of  the  form 


x(t)  =  F  (t)x(t)  +  G(f)w(f) 


(3.43) 


where  x(t)  is  the  system  state,  w(/)  is  zero-mean  white  Gaussian  noise,  and  F(f),  G(l)  are 
the  state  dynamics  and  noise  matrices  which  characterize  the  relationships  between  the 
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system  state  and  noise  processes.  As  mentioned  in  [18],  the  ARDrone’s  internal 
stabilization  algorithm  makes  capturing  the  vehicle’s  dynamics  difficult  and  thus  an 
appropriate  approximation  is  a  constant  velocity  model. 

The  state  of  a  single  vehicle  z  is  represented  by  the  vector 


x«(0  = 


X” 


Y i  ZT  vx,  V"  VI  6^  Sh 


(3.44) 


where  (X",  Y",  Z")  is  the  position  of  the  z-th  vehicle  in  the  navigation  frame,  (V",  V",  V") 
is  the  velocity  of  the  z'-th  vehicle  in  the  navigation  frame,  and  6ij/j  are  the  heading  error 
and  heading  error  rate  for  the  z'-th  vehicle.  The  pitch  and  roll  errors,  66,  6<p,  are  not 
modelled  as  they  are  not  observable  from  the  measurements.  The  state  dynamics  matrix 
for  the  z-th  vehicle 


0  0  0  1  0  0  0  0 
0  0  0  0  1  0  0  0 
0  0  0  0  0  1  0  0 


Fz(0 


00000000 

00000000 


00000000 


0  0  0  0  0  0  0  1 


00000000 


(3.45) 


characterizes  the  constant  velocity  nature  of  the  system  model  by  setting  the  derivatives  of 
the  velocity  states  to  zero. 

Noise  is  allowed  to  propagate  through  the  system  by  entering  the  velocity,  heading 
error,  and  heading  error  rate  states.  The  noise  processes  are  assumed  to  be  independent  of 
each  other,  leading  to  the  noise  vector  for  the  z-th  vehicle 


w,(0 


T 


WVx  WVy  Wyz  W6l/l  Wgj, 


(3.46) 
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It  follows  that  the  noise  matrix  is  then 


0  0  0  0  0 
0  0  0  0  0 
0  0  0  0  0 


G  i(t) 


1  0  0  0  0 
0  10  0  0 


0  0  10  0 


0  0  0  1  0 
0  0  0  0  1 


(3.47) 


Assuming  that  the  states  and  noise  processes  for  each  vehicle  are  also  independent  from 
one  another,  the  complete  system  model  for  the  two  vehicle  network  is 


x,(t) 

f  m 

0 

x,(t) 

G  i(t) 

0 

w  ,(t) 

— 

+ 

xt(t) 

0 

Ff(0 

x,(t) 

0 

Gf  {t) 

w  ,{t) 

where  the  subscripts  l  and  t  indicate  variables  for  the  lead  and  trail  vehicles  as  defined  in 
Equations  3.44  -  3.47.  The  noise  strength  for  each  vehicle  is  defined  by 


Q/(0  = 


0.0625  0  0 

0  0.0625  0 

0  0  0.0625 


0 

0 


0 

0 


0 

0 


0  0 

0  0 

0  0 

0.001  0 

0  0.01 


(3.49) 


which  also  indicates  that  the  noise  variables  are  assumed  independent  from  each  other. 
The  non-zero  elements  of  Q,(t)  were  chosen  as  a  starting  point  for  tuning  the  filter  and 
will  be  investigated  in  the  next  chapter.  They  were  selected  because  they  were  the  initial 
values  of  the  noise  strength  matrix  in  [18]. 
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3.7.2  Vanishing  Points. 

Vanishing  point  measurements  are  a  scalar  which  directly  measure  the  heading  error 
state  at  time  4 .  The  measurement  can  be  simply  expressed  as 


6^EAS(tk)  =  Mv(4))  +  v^(4)  (3.50) 

htyfa)  =  6i/fi(tk)  (3.51) 

where  5ipfEAS  (tk)  is  the  heading  error  reported  by  the  vanishing  point  estimation 
algorithm  and  dij/j  is  simply  the  heading  error  state  of  the  z-th  vehicle,  as  shown  in 
Equation  3.44.  The  influence  matrix  associated  with  /z^(x,(4))  is  simply 


dhg^iXi) 

M  =  3(^(4))  = 

The  measurement  error  covariance  for  each  vanishing  point  measurement  is 


(3.52) 


Re#  =  0.04  (3.53) 

in  units  of  rad2.  The  covariance  value  was  chosen  to  be  the  same  value  as  used  in  [18]. 
3.7.3  Delayed  State  Extended  Kalman  Filter  Update. 

Using  the  delayed  state  algorithm  described  in  section  2.7.2,  the  measurement 
described  in  Section  3.6.4  and  the  vanishing  point  measurements  described  in  the  previous 
section  can  be  integrated  into  the  Kalman  filter.  First,  the  nonlinear  measurement  equation 
is 


z  =  h(x(4),x(4_0)  +  v 
where  z  is  a  vector  of  measurements 


^  (4)  ^  (4) 


L BATCH 


(3.54) 


(3.55) 
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and 


h(x(4),x(4_i)) 


8*ffi(tk) 

SifJiih) 

CfM)C”(p"(r,)-p«fe_i)) 
CfM)C”(p?(4)-p?(4-i)) 
Cf  ' :-l)C”(p”(4-i)  -  p"(4-i)) 


(3.56) 


where  p"(4),  p"(4-i),p"(4),  p"(4_0  are  the  positions  at  time  4  and  4_i  respectively  and 
C~'(4  l)  is  the  transpose  of  the  body  to  nav  DCM  computed  by  Equation  2.4  using  the  body 
angles  0f(4_i),  0,(4- 1 ),  iA,(4-i)-  The  DCM  C”  is  a  function  of  the  trail  vehicle  heading 
error  at  time  4_i  and  can  be  calculated  from 
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h 
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cos(d<M4-i))  -  sin(diAr(4-i))  0 

sin(d<M4_i))  cos(diM4-i))  0 

0  0  1 


Equation  3.56  can  then  be  linearized  to  form  the  influence  matrices 
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Al  =  C^l)7^\C> l(tk)  -  Pr(4-l)) 

C\ 

A2  =  -  p?(4-0) 

d{8ip) 

Aa  =  ^^(p^O  -  P,(4-i)) 
where  ^-C"  can  be  calculated  as 

-sin(d(Af(4-i)  -cos(d^(4_i)  0 
cos(di/r,(4_i)  -  sin(4iA;(4~i)  0 
0  0  0 
The  measurement  covariance  for  the  delayed  state  update  is  nominally 


(3.60) 


R<5i//  o  o 

R  =  0  Rty  0  (3.61) 

0  0  R  BATCH 

in  the  case  where  the  vanishing  point  estimation  algorithm  returns  two  vanishing  point 
measurements  and  the  batch  processing  algorithm  is  able  to  produce  a  single 
measurement. 


3.8  Procedures 

This  section  describes  a  number  of  necessary  calibration  steps.  Where  possible,  these 
procedures  were  automated  in  the  ROS  framework. 

3.8.1  Camera  Calibration. 

Camera  calibration  was  conducted  prior  to  each  flight  using  a  publicly  available  ROS 
camera  calibration  package.  The  camera  calibration  package  uses  OpenCV  camera 
calibration  code,  which  is  based  on  the  algorithms  found  in  [9]  and  [69].  The  camera 
calibration  package  subscribes  to  a  topic  containing  the  image  stream  of  the  camera  to  be 
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calibrated.  The  image  stream  is  processed  using  an  OpenCV  function  which  automatically 
detects  a  chessboard  pattern  and  estimates  the  inner  comers.  Figure  3.6  is  a  screen  capture 
of  the  ROS  package  capturing  a  chessboard  in  an  image.  The  camera  calibration  package 
can  determine  when  there  is  enough  image  data  to  constrain  the  calibration  equations  and 
return  valid  calibration  parameters. 


Figure  3.6:  Automatic  Camera  Calibration.  The  ROS  camera  calibration  package 
automates  camera  calibration  by  recognizing  a  calibration  pattern. 


3.8.2  Tilt  Calibration  . 

In  order  to  keep  the  lead  vehicle  in  its  field  of  view,  a  tilt  angle  was  added  to  the  trail 
camera.  The  tilt  needs  to  be  calibrated  to  improve  the  accuracy  of  the  image  processing 
and  filtering  algorithms.  Like  camera  calibration,  tilt  calibration  uses  the  calibration  board 
shown  in  Figure  3.6.  The  chessboard  auto-detection  will  detect  the  same  10x10  grid  of 
points  -  shown  as  colored  circles  -  in  each  image.  The  grid  points  belong  to  an  object 
coordinate  frame  where  the  x-axis  runs  left  to  right,  the  y-axis  runs  top  to  bottom,  and  the 
z-axis  points  towards  the  wall.  The  object  coordinates  of  each  grid  point  are  then 


P  = 


xc  -/•(/-  1 )  yc-/-(j-l)  0 


T 


(3.62) 
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where  i  is  the  column  number,  j  is  the  row  number,  /  is  the  length  of  each  square  in  the 
chessboard  pattern,  and  (jcc,  yc)  is  the  center  of  the  grid. 

Using  the  camera  matrix  and  distortion  coefficients  found  from  camera  calibration, 
the  grid  point  coordinates,  and  the  image  location  of  each  grid  point,  OpenCV’s 
solvePnP  function  can  be  used  to  the  find  the  relative  orientation  between  the  camera 
frame  and  the  calibration  board  frame.  By  placing  the  trail  vehicle  on  a  level  surface  and 
ensuring  the  calibration  board  is  perpendicular  to  the  floor  and  roughly  centered  in  the 
image,  the  relative  orientations  correspond  to  the  tilt  angles  of  the  trail  camera.  These 
angles  are  computed  for  each  image  in  the  calibration  sequence  and  averaged. 

The  rotations  about  the  y  and  z  axes  were  found  to  be  negligible;  roughly  0°  for  both 
axes.  However,  the  rotation  a  about  the  x  axis,  which  is  the  axis  about  which  the  camera 
was  rotated,  is  of  interest.  After  calibration  of  a  the  DCM  relating  the  body  frame  and 
camera  frame  of  the  trail  vehicle  Cb‘  can  be  calculated  by  the  equation 


cos  (or) 

0 

sin(cr) 

0 

1 

0 

r^b 

(3.63) 

-  sinter) 

0 

cos(a) 

where  C1’.  is  equivalent  to  the  body  to  camera  DCM  described  in  section  2.4.  The 
calibrated  tilt  angle  a  was  found  to  be  -1 1.5°.  The  DCM  for  relating  the  body  and  camera 
frame  of  the  lead  vehicle  C*'  is  simply  C*. 

3.8.3  Targeting  Calibration. 

Although  the  backlit  targets  are  less  sensitive  to  changes  in  lighting  conditions  during 
a  data  collection,  the  color  ranges  of  the  targets  on  the  lead  vehicle  need  to  be  calibrated 
before  each  collection  to  ensure  that  the  targets  can  be  reliably  tracked.  This  is  especially 
important  when  operating  in  different  environments  (e.g  the  Vicon  chamber  v.  the 
hallway).  Figure  3.7  shows  the  target  calibration  GUI  designed  to  calculate  the  necessary 
HSV  ranges  and  blur  parameters  for  target  tracking. 
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Figure  3.7:  Calibration  of  the  Target  Detector.  The  GUI  designed  allows  the  user  to 
manipulate  the  HSV  ranges  and  blur  parameters  using  OpenCV  trackbars.  The  parameters 
should  be  adjusted  so  that  targets  are  clearly  distinguishable  in  the  binary  image  (bottom 
left). 


It  is  not  necessary  to  calibrate  the  target  detector  to  the  point  where  all  non-target 
regions  are  completely  blocked  out.  Instead,  the  values  should  be  adjusted  so  that  a 
balance  between  target  area  and  background  noise  is  found.  The  parameters  are  saved  in  a 
YAML  file  and  loaded  into  the  detector  at  run  time. 

3.8.4  Data  Collection. 

Data  is  captured  using  a  ROS  tool  called  a  bag.  ROS  bags  store  serialized  data  from 
messages  as  they  are  received  [2].  The  bag  can  be  played  back  to  the  same  topics  they 
were  recorded  from.  In  addition,  the  bag  can  be  played  at  a  slower  rate,  paused,  and 
configured  from  a  launch  file.  The  utility  of  ROS  bags  allows  for  seamless  transitioning 
between  online  capture  and  post-processing  of  data. 
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3.9  Summary 

This  chapter  developed  the  software,  algorithms,  and  tools  used  in  this  thesis.  The 
first  section  outlined  the  system  architecture.  The  next  sections  developed  the  target 
tracking  algorithm,  the  relative  camera  pose  estimation  algorithm,  the  feature 
correspondence  algorithm,  and  the  feature  triangulation  algorithm  which  yield  estimates 
of  relative  camera  positions  and  feature  locations.  A  Levenburg-Marquardt  batch 
processing  algorithm  was  also  developed  to  process  relative  pose  and  feature 
correspondence  measurements  into  a  single  estimate  of  the  motion  of  both  vehicles  over 
one  time  epoch.  The  delayed  state  Kalman  filter  equations  were  then  developed  to 
estimate  vehicle  position  based  on  vanishing  point  and  batch  processor  measurements. 
The  final  section  described  procedures  for  calibration  prior  to  running  experiments.  The 
next  chapter  will  present  a  set  of  experiments  which  validate  the  operation  of  the  system 
and  its  algorithms  described  in  this  chapter. 
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IV.  Results 


In  chapter  3,  a  system  of  algorithms  were  developed  to  enable  cooperative  localization 
for  two  airborne  vehicles.  This  chapter  presents  a  set  of  experiments  used  to  validate 
the  algorithms  developed  in  this  thesis.  The  experiments  progress  in  stages,  proceeding 
from  evaluating  the  performance  of  the  individual  algorithms,  up  to  evaluating  the 
performance  of  the  entire  cooperative  binocular  stereopsis  algorithm. 

4.1  Testing  Equipment 

When  running  experiments  the  two  ARDrone  vehicles  were  mounted  on  a  rigid  cart 
assembly,  shown  in  Figure  4.1.  The  cart  ensures  that  the  targets  on  the  lead  vehicle  are 
within  view  of  the  trail  vehicle’s  camera  at  all  times.  The  wireless  router  also  rests  on  the 
cart  assembly,  so  that  both  vehicles  are  always  within  a  short  range  of  the  router. 


Figure  4.1:  Testing  Equipment.  The  equipment  shown  was  used  to  keep  the  lead  vehicle 
with  illuminated  targets  in  view  of  the  trailing  vehicle  during  tests. 
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It  is  important  to  note  that  although  the  vehicles  were  mounted,  this  information  was 
not  given  to  any  of  the  algorithms.  The  developed  algorithms  re-estimate  the  relative 
position  and  attitude  of  the  two  vehicles  for  each  image  pair  as  if  their  relative  positions 
were  changing  with  time. 

4.2  Feature  Matching  Experiment 

A  number  of  different  algorithms  for  feature  matching  are  provided  within  OpenCV. 
Specifically,  OpenCV  provides  objects  for  feature  extraction,  feature  descriptor  extraction, 
and  feature  descriptor  matching.  Each  object  has  its  own  set  of  tuning  parameters  which 
can  be  configured  in  their  construction  or  adjusted  during  run  time  [10].  It  is  unrealistic  to 
attempt  a  comparison  of  feature  extractors,  descriptors,  and  matchers  -  there  are  arguably 
many  combinations  which  could  achieve  good  performance  with  the  right  tuning.  Instead, 
an  extractor,  descriptor,  and  matcher  were  chosen  and  tuned  until  an  acceptable 
performance  was  reached. 

The  two  vehicles  were  mounted  on  the  cart  assembly  and  their  image  feeds  were 
recorded  to  two  synchronized  video  files.  The  cart  was  allowed  to  move  so  that  the  test  of 
the  feature  matching  algorithm  was  more  robust.  The  video  files  were  then  processed  to 
find  feature  matches  for  each  image  pair  in  the  video  files.  For  this  test,  matching  was 
performed  at  the  descriptor  level  -  the  distance  between  feature  descriptors  is  the  only 
criteria  for  determining  a  match. 

4.2.1  Feature  Matching  Experiment  Results. 

For  this  research,  the  FAST  [64]  feature  extractor,  BRISK  [40]  descriptor  extractor, 
and  Brute  Force  Hamming  matcher  [10]  were  used  for  feature  matching.  These  feature 
and  descriptor  extractors  were  chosen  because  of  their  low  computational  cost  and 
comparable  performance  to  SIFT  and  SURF.  To  overcome  the  fact  that  FAST  is  not 
inherently  scale-invariant,  a  Gaussian  pyramid  adapter  [10]  was  used  to  modify  the  feature 


75 


extraction  algorithm  to  allow  for  scale-space  feature  extraction.  Table  4. 1  shows  the 
values  of  each  tuning  parameter  for  the  extractor,  descriptor,  and  matcher  selected. 


Table  4.1:  Matching  Parameters. 


Threshold 

15 

FAST 

Normax  Suppression 

false 

Pyramid  Levels 

3 

Threshold 

1 

BRISK 

Octaves 

3 

Pattern  Scale 

2.2 

Brute  Force  Matcher 

Cross-check 

true 

Table  4.2  shows  the  matching  performance  achieved  for  a  sample  image  sequence 
after  tuning  the  selected  feature  matching  objects  and  Gaussian  pyramid  adapter.  The 
results  indicate  that  on  average  OpenCV’s  feature  matching  objects  will  return  enough 
putative  feature  correspondences  to  continue  with  the  feature  correspondence  algorithm 


Table  4.2:  Matching  Performance. 


Frames  Processed 

103 

Avg  Detection  Time  (ms) 

1.242 

Avg  Extraction  Time  (ms) 

2.525 

Avg  Matching  Time  (ms) 

1.533 

Avg  Features  Per  Frame 

207 

Avg  Matches  Per  Frame 

96 
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4.3  Relative  Camera  Pose  Estimation  Accuracy  Experiment 

The  goal  of  this  experiment  was  to  characterize  the  error  associated  with  estimating 
the  six  variables  defining  relative  camera  pose  (three  for  translation  and  three  for 
orientation).  In  addition,  the  results  this  test  will  be  used  to  construct  the  error  covariance 
of  the  relative  camera  position  estimate  RA.  The  vehicles  were  initially  mounted  on  the 
cart  assembly  in  the  Vicon  chamber.  Next,  the  target  areas  were  initialized  so  that  the 
target  tracking  algorithm  could  lock  on  to  the  vehicle  targets  on  the  lead  vehicle.  Once  a 
lock  was  acquired,  the  trail  vehicle  could  be  moved  around;  however,  movement  needed  to 
be  gradual  to  prevent  the  target  tracking  algorithm  from  losing  lock.  A  ROS  synchronizer 
was  used  to  match  up  relative  pose  estimates  with  Vicon  motion  capture  data  and  write  the 
information  to  a  CSV  file.  Finally,  the  data  file  was  processed  in  MATLAB  to  determine 
the  error  in  the  relative  camera  pose  estimation  algorithm. 

4.3.1  Relative  Camera  Pose  Estimation  Accuracy  Experiment  Results. 

Figures  4.2  and  4.3  show  the  distributions  for  2500  samples  of  the  error  between  the 
translation  and  rotation  estimated  by  the  image  processing  algorithm  and  the  true 
translation  and  rotation  calculated  from  Vicon  motion  capture  data. 

The  bias  in  the  errors  are  likely  due  to  errors  in  the  trail  camera  model  and  the 
measured  target  locations.  The  target  locations  were  measured  by  hand  and  are  likely  only 
accurate  within  a  few  centimeters.  The  distributions  for  the  translation  estimation  errors 
do  not  appear  to  be  Gaussian;  however,  the  distributions  are  almost  symmetrical  about 
their  means  and  their  spreads  are  relatively  small  such  that  Gaussian  approximations  can 
be  applied.  The  X  error  distribution  has  a  mean  of  -0.008  m  and  a  standard  deviation  of 
0.007  m.  The  Y  error  is  has  a  mean  of  -0.071  m  and  a  standard  deviation  of  0.005  m.  The 
Z  error  distribution  has  a  mean  of  -0.022  m  and  a  standard  deviation  of  0.004  m.  The 
measurement  error  covariance  for  relative  camera  position  estimates  is  therefore 


77 


(a)  X  Error 


(b)  Y  Error 


Error  (m) 


(c)  Z  Error 

Figure  4.2:  Distributions  of  Relative  Translation  Errors. 
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0.000049  0 


0 


Ra  =  0  0.000025  0 


(4.1) 


0 


0  0.000016 


and  the  mean  errors  can  be  subtracted  from  the  relative  translation  estimate  to  yield  an 
unbiased  estimate. 

Similar  to  the  biases  in  the  translation  estimates,  there  are  small  biases  in  the 
rotational  estimates.  These  biases  are  also  likely  due  to  camera  calibration  errors  and 
errors  in  the  measured  target  locations.  The  distributions  of  the  rotation  estimation  errors 
appear  less  well-behaved  than  the  translational  errors.  However,  their  spreads  are  also 
relatively  small  such  that  the  means  can  be  used  to  calibrate  the  rotation  estimate.  The 
X-axis  rotation  error  has  a  mean  of  -0.796°  and  a  standard  deviation  of  0.459°.  The  T-axis 
rotation  error  has  a  mean  of  -0.004°  and  a  standard  deviation  of  0.380°.  The  Z-axis 
rotation  error  has  a  mean  of  -0.336°  and  a  standard  deviation  of  0.861°.  Although  the 
relative  orientation  is  not  used  as  a  measurement,  it  is  used  with  the  translation  estimate  to 
calculate  the  fundamental  matrix  between  paired  images.  The  estimated  relative 
orientation  and  translation  are  also  used  to  triangulate  features,  as  described  in  Section 


3.5.1. 


4.4  Triangulation  Accuracy  Experiment 

The  goal  of  this  experiment  was  to  characterize  the  error  in  the  triangulation 
algorithm  presented  in  Section  3.5.1.  Features  in  the  hallway  were  hand  selected  from 
images,  as  shown  in  Figure  4.4.  The  baseline  and  relative  orientation  for  the  two  cameras 
were  estimated  using  OpenCY’s  solvePnP  function,  using  the  CV.ITERATIVE  option. 
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Error  (deg) 


(a)  X-Axis  Rotation  Error 


(b)  Y-Axis  Rotation  Error 


260 


Error  (deg) 

(c)  Z-Axis  Rotation  Error 

Figure  4.3:  Distributions  of  Relative  Orientation  Errors. 
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Figure  4.4:  Sample  Points  for  Evaluating  Triangulation  Accuracy.  Corresponding  points 
were  hand  selected  from  two  images. 


Since  points  were  sampled  in  the  hallway,  there  was  no  access  to  truth  data  from 
Vicon.  Instead,  a  laser  range-finder  was  used  to  determine  the  true  depth  to  each  target 
feature.  The  depth  to  each  sampled  feature  is  the  least  observable  parameter  in 
triangulation,  and  consequently  has  the  most  error.  The  dominating  factor  for  determining 
the  depth  of  a  feature  is  the  parallax  angle  between  both  cameras’  line  of  sight  vectors. 
Figure  4.5  illustrates  the  relationship  between  parallax  angle  and  depth  estimation  error. 

Clearly  Figure  4.5  shows  that  there  is  a  high  correlation  between  parallax  angle  and 
depth  estimation  error.  Features  correspondences  with  a  parallax  angle  less  than  0.06 
radians  may  have  up  to  3  m  of  depth  error.  This  lower  bound  on  the  parallax  angle  was 
used  in  the  image  processing  algorithm  to  determine  if  a  three  dimensional  feature  could 
be  accurately  triangulated  from  its  pair  of  image  measurements. 

One  unanticipated  product  of  this  test  was  the  indication  of  an  upper  bound  on  the 
parallax  for  real  feature  correspondences.  From  Figure  4.5,  the  maximum  achievable 
parallax  for  the  image  pair  is  nearly  0.18  radians.  This  limit  on  the  parallax  angle  can  be 
used  to  discard  erroneous  matches  which  appear  to  have  an  extremely  good  parallax 
angle.  Figure  4.6  shows  an  example  of  such  a  match. 
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Parallax  Angle  v.  Depth  Error 
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Figure  4.5:  Feature  Depth  Error  v.  Parallax  Angle.  The  error  in  depth  determined  by  the 
triangulation  algorithm  and  by  a  laser  range  finder  are  plotted  against  the  parallax  angle. 


Figure  4.6:  An  Erroneous  Match  With  Good  Parallax  Angle.  This  figure  shows  a  possible 
scenario  in  which  the  intensity  of  a  light  fixture  is  erroneously  matched  to  refelctions  on  a 
floor  tile.  The  image  coordinates  of  the  light  fixture  in  the  left  image  map  to  a  line  which 
passes  through  the  true  feature  location  and  converges  at  the  epipole  in  the  second  image. 


Typically,  the  epipolar  constraint  can  be  used  to  determine  erroneous  matches. 
However,  the  epipolar  constraint  maps  a  point  in  one  image  to  a  line  of  possible 
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corresponding  points  in  another  image.  The  mistmatched  points  shown  in  Figure  4.6  pass 
the  epipolar  constraint  and  their  computed  parallax  angle  is  0.4530  radians.  Enforcing  a 
maximum  limit  on  parallax  angle  ensures  that  matches  like  the  one  shown  in  Figure  4.6 
are  discarded. 

4.5  Batch  Processing  Experiment 

The  goal  of  the  this  test  was  to  determine  the  accuracy  of  the  batch  processor  in  the 
hallway  environment.  To  do  this,  the  two  ARDrone  vehicles  were  mounted  on  the  cart 
assembly  and  pushed  down  the  hallway  at  a  constant  velocity  and  heading.  Figure  4.7 
illustrates  the  desired  course  for  this  test.  An  important  distinction  between  the  testing  in 
this  section  and  the  testing  done  in  [18]  is  that  the  test  course  did  not  need  to  be  modified 
prior  to  testing.  The  feature  correspondence  algorithm  is  able  to  capture  enough  natural 
features  in  the  environment  to  extract  a  positional  measurement. 


12.5  m 


/ _ 

2.3  m 

V - 

Desired  Course 

Figure  4.7:  Hallway  Course  Used  for  Testing  the  Batch  Processing  Algorithm. 


Because  the  vehicles  were  mounted  to  the  cart  assembly  and  the  heading  is  nominally 
aligned  with  the  center  of  the  hallway,  it  was  assumed  that  motion  would  only  be 
observable  in  the  +X  direction.  The  average  velocity  in  the  A-direction  was  computed  by 
dividing  the  course  length  by  the  time.  The  average  change  in  position  between  frames  in 
the  X  direction,  AXavg,  could  then  be  approximated  by  the  equation 
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^Xavg  —  r 

frame  rate 


average  velocity 


(4.2) 


and  the  average  change  in  position  between  frames  for  the  other  two  directions  was 
assumed  to  be  zero.  In  addition,  each  data  file  used  in  this  test  was  clipped  to  a  region  of 
frames  in  which  the  motion  approximation  would  hold. 

4.5.1  Batch  Processing  Experiment  Results. 

The  hallway  test  was  carried  out  for  four  trials.  The  results  presented  in  this  section 
are  from  a  single  trial,  but  are  representative  of  the  performance  of  the  batch  processor  in 
each  trial.  Figure  4.8  shows  the  solution  errors  over  time,  found  by  subtracting  the  average 
motion  per  frame  from  the  solution  returned  by  the  batch  processor. 

Notice  that  the  solution  errors  in  Figure  4.8c  are  much  lower  than  the  errors  shown  in 
Figures  4.8a  and  4.8b.  Since  the  body  frame  of  the  trail  vehicle  at  time  4_i  is  the  refrence 
frame  for  the  batch  processor,  the  relative  camera  position  estimate  which  relates  the 
position  of  the  lead  vehicle  to  the  position  of  the  trail  vehicle  at  time  4-i  is  an  absolute 
measurement  of  the  position  of  the  lead  vehicle  at  time  4_i  in  this  reference  frame.  It  is 
therefore  expected,  and  indeed  verified,  that  the  solution  errors  in  Figure  4.8c  should  be  on 
the  order  of  the  level  of  accuracy  of  relative  camera  position  estimate. 

The  solution  errors  appear  to  be  zero  mean  and  are  well  bounded  by  3-cr  standard 
deviation  as  computed  by  the  batch  processor,  with  exception  to  a  few  easily  identifiable 
outliers.  This  indicates  that  the  batch  processor  is  able  to  accurately  characterize  the 
precision  in  its  solution.  It  was  found,  however,  that  solutions  in  which  the  covariance 
deviates  too  far  from  the  average  cause  the  Kalman  filter’s  estimate  to  be  thrown  off.  Each 
of  the  ( X ,  Y.Z)  triplets  in  Figure  4.8  can  be  handled  separately  from  eachother  with  the 
following  constraint 


(4.3) 
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Figure  4.8:  Batch  Processor  Solution  Errors.  The  batch  processor  solution  errors  (blue)  are 
plotted  with  3-cr  standard  deviation  lines  (red). 


which  is  essentially  the  magnitude  of  the  three-dimensional  position  uncertainty  for  each 
position  in  the  batch  solution  state.  The  value  of  /3  was  empirically  chosen  to  be  0.2  for  the 
first  triplet,  0.2  for  the  second  triplet,  and  0.01  for  the  third  triplet.  Solutions  which  exceed 
any  of  the  these  constraints  are  discarded.  Figure  4.9  shows  the  improvement  in  the 
Kalman  filter  estimate  after  applying  these  constraints.  In  both  instances  where  the  filter 
estimate  begins  to  diverge,  the  solution  errors  exceed  one  of  the  empirical  thresholds. 
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(a)  Filter  Estimate  Without  Constraints  (b)  Filter  Estimate  With  Constraints 

Figure  4.9:  Effect  of  Enforcing  Solution  Error  Constraints.  The  X-Y  trajectories  estimated 
by  the  Kalman  filter  with  (left)  and  without  (right)  constraints  on  the  batch  solution  error. 


It  it  more  pertinent  to  analyze  the  statistical  properties  of  the  measurement  which  is 
given  to  the  Kalman  filter.  Figure  4.10  shows  the  distribution  of  measurement  errors. 

Since  the  covariance  of  the  measurement  is  recomputed  for  each  iteration,  the 
measurement  error  is  divided  by  its  standard  deviation  to  normalize  the  errors  to  the  same 
distribution.  The  histogram  in  Figure  4.10  closely  follows  a  standard  zero-mean  Gaussain 
distribution.  This  is  important,  since  the  delayed  state  Kalman  filter  assumes  that  the  error 
in  the  measurements  follows  a  Gaussian  distribution. 

4.6  Kalman  Filter  Experiments 

The  goal  of  these  experiments  was  to  determine  the  accuracy  of  the  delayed  state 
Kalman  filtering  algorithm  in  phases.  First,  the  Kalman  filter  will  be  validated 
independently  from  the  batch  processor.  Then  the  Kalman  filter  will  be  validated 
including  the  batch  processor.  Due  to  an  implementation  issue,  the  Kalman  filtering 
algorithm  was  duplicated  in  MATLAB.  There  appeared  to  be  numerical  issues  in 
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0.45 


Measurement  Distribution 


Figure  4.10:  Normailized  Measurement  Error  Distribution.  This  figure  shows  a  histogram 
of  measurement  errors  (blue)  normalized  by  their  standard  deviation  and  a  trace  of  a 
standard  zero-mean  Gaussian  distribution  (red). 


computing  the  Kalman  gain  using  OpenCV’s  linear  algebra  which  caused  the  filter  to 
reject  too  many  measurements.  In  each  experiment,  data  is  first  processed  in  C++  for 
measurements,  then  imported  into  MATLAB  to  be  used  by  the  Kalman  filter  algorithm. 

4.6.1  Monte  Carlo  Simulations. 

The  purpose  of  this  test  was  to  isolate  the  Kalman  filtering  algorithm  from  the  batch 
processing  algorithm  and  determine  the  filter’s  accuracy  with  generated  measurements. 
The  flight  profile  in  Figure  4.11  was  captured  by  the  Vicon  system  and  has  milimeter  level 
accuracy. 

Simulated  measurements  were  formed  by  transforming  the  truth  data  into  a  sequence 
of  exact  (error-free)  measurements  and  then  adding  Gaussian  noise.  The  Gaussian  noise 
was  generated  from  a  distribution  with  constant  covariance  -  not  a  changing  covariance  as 
would  be  expected  if  the  filter  were  connected  to  the  batch  processor.  The  covariance  was 
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(a)  Lead  Flight  Profile 


(b)  Trail  Flight  Profile 


Figure  4.1 1:  Flight  Profile  for  Monte  Carlo  Simulations.  The  flight  profile  exhibits  forward 
and  backward  motion  in  two  directions  as  well  as  periods  of  no  motion. 


set  to  a  reasonable  estimate  of  the  average  covariance  computed  by  the  batch  processor. 
Because  the  covariance  does  not  vary  like  the  real  data,  adequate  performance  in  this  test 
is  a  necessary,  but  not  sufficient  condition  to  validate  the  Kalman  filtering  algorithm. 

4.6.2  Monte  Carlo  Simulation  Results. 

The  Monte  Carlo  simulation  consisted  of  1000  individual  trials.  The  initial  state  for 
each  trail  was  also  given  random  Gaussian  error.  Figure  4.12  shows  the  ensemble  mean  of 
the  filter  computed  trajectories,  the  true  trajectory,  the  filter  computed  standard  deviation, 
and  the  ensemble  standard  deviation  of  the  filter  computed  trajectories. 

It  is  clear  from  these  results  that  on  average  the  Kalman  filter  can  be  expected  to 
closely  follow  the  true  trajectory,  given  that  the  measurements  do  not  deviate  much  from 
the  measurement  covariance  used  to  generate  measurements  in  this  experiment.  The 
Monte  Carlo  simulation  also  reveals  a  primary  weakness  of  the  filtering  algorithm.  Over 
time,  the  uncertainty  in  the  filter’s  estimate  grows,  most  noticeably  in  the  Z  direction.  This 
is  because  measurements  correspond  to  differences  between  two  states  of  the  filter  - 
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(a)  Lead  Estimate  (b)  Trail  Estimate 

Figure  4.12:  Monte  Carlo  Kalman  Filter  Simulation  Results.  Estimates  for  the  position 
of  the  lead  (a)  and  trail  (b)  vehicles.  The  ensemble  mean  trajectory  (solid  blue)  and  filter 
computed  standard  deviation  (dashed  blue)  are  plotted  against  the  true  trajectory  (solid 
green)  and  ensemble  standard  deviation  (dashed  green) 


contributing  only  relative  information.  Because  absolute  information  is  not  brought  into 
the  filter,  the  filter  can  only  reduce  the  rate  of  error  growth.  Overcoming  this  issue  is  not 
within  the  scope  of  this  research  and  will  be  left  for  future  work. 

4.6.3  Processor  Integration  Test  Setup. 

The  purpose  of  this  test  was  to  investigate  the  performance  of  the  Kalman  filtering 
algorithm  by  using  outputs  from  the  batch  processor  and  the  corrected  body  angles.  The 
two  ARDrone  vehicles  were  mounted  on  the  cart  assembly  and  placed  at  the  starting 
position  of  the  test  course  shown  in  Figure  4.13.  For  each  sample  run,  the  cart  assembly 
was  pushed  at  a  constant  rate  and  heading  down  the  center  of  the  test  course  and  the  image 
and  navigation  data  was  saved  to  a  ROS  bag  file. 

To  process  the  bag  file,  the  image  and  navigation  data  were  converted  into  two 
synchronized  videos  and  a  CSV  file  with  time  tagged  navigation  measurements  (i.e.  the 
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Figure  4.13:  Hallway  Course  Used  for  Testing  the  Integration  of  the  batch  processor  and 
Kalman  Filter. 


reported  body  angles).  This  ensured  that  the  software  would  always  process  the  same 
information.  Next,  the  image  sequences  and  navigation  measurements  were  processed  for 
measurements  and  corrected  body  angles.  For  each  image  pair,  the  vanishing  points  were 
extracted  and  used  to  correct  the  error  in  the  reported  yaw  angle.  The  images  were  also 
processed  for  matching  features  and  the  batch  processor  was  used  to  extract  the  batch 
measurement.  The  measurement  and  body  angle  data  was  then  uploaded  into  MATLAB 
and  processed  in  the  Kalman  filtering  algorithm  to  produce  an  estimate  of  the  vehicle 
trajectories.  The  test  course  was  sampled  three  times,  each  at  a  desired  rate  of  15  Hz.  For 
each  sample,  the  first  relative  pose  estimate  was  used  to  initialize  the  Kalman  filter  state. 

4.6.4  Unfiltered  Trajectory  Estimate. 

As  a  point  of  comparison,  Figure  4.14  first  presents  the  estimated  X-Y  trajectories  for 
each  sample  based  on  taking  the  cumulative  sum  of  the  delta-position  information 
contained  in  the  measurements  (i.e.  without  filtering  the  measurements).  As  an 
intermediate  step,  outliers  were  identified  and  removed  from  the  measurements.  Without 
position  information  from  another  source,  it  is  not  possible  to  determine  the  error  in  the 
trajectory  estimate  over  time.  Instead,  Table  4.3  summarizes  the  final  RMS  position  errors. 
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(a)  Lead  Vehicle  Trajectory 


(b)  Trail  Vehicle  Trajectory 


Figure  4.14:  Trajectories  Estimated  From  Measurements  Only.  The  trajectories  calculated 
by  accumulating  raw  measurement  data  are  plotted  for  the  lead  and  trail  vehicles.  The  solid 
black  lines  indicate  the  walls  of  the  hallway. 


Table  4.3:  Final  Position  Errors  Using  Measurements  Only. 


Sample 

Fead  RMS  Error  (m) 

Trail  RMS  Error  (m) 

1 

2.807 

3.372 

2 

4.412 

4.072 

3 

3.429 

3.907 

Figure  4.14  shows  that  the  vehicle  motion  captured  by  the  measurements  is  consistent 
with  what  would  be  expected.  In  sample  three,  however,  there  appears  to  be  a  persistent 
bias  in  the  Y  direction  that  causes  the  raw  trajectories  to  drift  outside  the  hallway  limits. 
The  errors  in  the  final  position,  as  indicated  in  Table  4.3,  are  likely  due  to  the  accumulation 
of  noise  as  the  delta-position  information  is  summed.  The  Kalman  filter  is  designed  to 
handle  noise  in  the  measurements  and  should  produce  a  better  estimate  of  the  trajectories. 
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4.6.5  Processor  Integration  Results  . 

Figure  4.15  shows  the  X-Y  trajectories  computed  by  the  Kalman  filter  for  each 
sample.  The  trajectories  computed  by  the  filter  agree  with  the  motion  captured  by  the 
measurements.  In  addition,  each  trajectory  computed  by  the  filter  is  noticeably  less  noisy 
than  the  trajectory  estimated  from  the  cumulative  sum  of  raw  measurement  data.  This  is 
because  the  Kalman  gain  limits  the  amount  of  information  that  each  measurement 
contributes  to  the  trajectory  estimate.  The  filter,  however,  is  unable  to  deal  with  the 
measurement  drift  in  sample  three. 


(a)  Lead  Vehicle  Trajectory 


(b)  Trail  Vehicle  Trajectory 


Figure  4.15:  Kalman  Filter  Position  Estimate.  The  filter  computed  trajectories  are  plotted 
for  the  lead  and  trail  vehicles.  The  solid  black  lines  indicate  the  walls  of  the  hallway. 


Figure  4.16  compares  the  measured  final  position  of  each  vehicle  to  the  filter’s 
estimate,  and  the  RMS  errors  are  summarized  in  Table  4.4.  Figure  4.16  and  Table  4.4 
show  that  there  is  an  inconsistency  between  the  true  error  and  the  error  estimated  by  the 
filter  for  sample  one,  while  the  solutions  are  consistent  for  the  other  two  samples.  This 
could  be  due  to  several  factors,  which  will  be  investigated  in  the  rest  of  this  section. 
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(a)  Final  Position  Estimate  (Lead) 


(b)  Final  Position  Estimate  (Trail) 


Figure  4.16:  Final  Position  Estimates.  The  final  position  estimate  for  each  sample 
(asterisks)  are  plotted  with  3-cr  computed  X-Y  standard  deviation  ellipses. 


Table  4.4:  Final  Position  Error  Using  Kalman  Filter. 


Sample 

Lead  RMS  Error  (m) 

Trail  RMS  Error  (m) 

1 

4.806 

4.871 

2 

1.257 

1.264 

3 

1.986 

1.882 

Firstly,  it  is  important  to  inspect  the  behavior  of  the  measurement  residuals  and 
ensure  that  measurement  outliers  are  being  rejected  by  the  filter.  Ideally,  the  measurement 
residuals  should  be  normally  distributed  with  zero  mean  and  covariance  determined  by  the 
Kalman  filter.  Therefore,  an  outlier  can  be  identified  by  a  residual  value  that  is  greater 
than  3-cr.  Due  to  correlations  between  measurements,  it  was  decided  that  if  any  element  of 
the  measurement  vector  exceeded  the  residual  tolerance,  the  entire  measurement  for  that 
epoch  would  be  rejected.  The  measurement  residuals  and  measurement  errors  are  shown 
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side-by-side  in  Figure  4.17.  This  data  is  from  sample  one,  but  is  representative  of  the  other 
trials  as  well. 


0  5  10  15  20  25  30 


Sys  Time  (s) 


(a)  Measurement  Residuals  p"'(fc) 


(b)  Measurement  Errors  p}1^) 


(c)  Measurement  Residuals  pJ'  (Ar) 


(d)  Measurement  Errors  p}l'(&) 


Figure  4.17:  Kalman  Filter  Residuals  v.  Measurement  Errors. 


94 


JM  (|  l\ 

l*J— _ J  A.  jLv  „  {  aV_  jlI  __  I  u 

.1  VI  I  .  ^  I  .  .  • 


ll 

.JrL. 

*w 

T 


_ i-J  i — 

o  *>v  /^VrwV*-  v;wv>/>>V^?v'  V  « 

rr*^nixrr^yir^r(rr,xMirv“iflr”'-1r 


J 


-o.i 


ii 


in 


ti"  j|  p  ji  i|  « 

Lvl )y  i/\  j ^  ji  jjli/L'*''  \4_j^  ,j ,.•, 

ir’ '^r'K'  >-jrir  tt'v 

i _ iu! _ .  M  . _ , _ 1 _ i 


V* 

— -■—ij— ~ 


i 

A 


15  20 

Sys  Time  (s) 


(a)  Measurement  Residuals  p\l(k  -  1) 


If 


10  15  20  25  30  35 


Sys  Time  (s) 

(b)  Measurement  Errorsp}v'(£  -  1) 


Figure  4.18:  Kalman  Filter  Residuals  v.  Measurement  Errors  (Continued).  (Left)  The 
residual  values  for  each  measurement  (blue)  with  3-cr  standard  deviation  lines  (red)  as 
computed  by  the  Kalman  filter.  (Right)  The  corresponding  measurement  (blue)  with  3-cr 
standard  deviation  (red)  as  reported  by  the  batch  processor.  Rejected  measurements  are 
shown  as  green  triangles. 


By  inspection  of  Figures  4.17  and  4.18,  it  can  be  verified  that  both  the  measurement 
errors  and  measurement  residuals  are  approximately  zero  mean  and  bounded  by  their 
respective  standard  deviation  values.  In  addition,  Figures  4.17  and  4.18  show  that  the  true 
outliers  are  being  rejected  by  the  filter.  The  error  in  the  trajectory  estimate  is  therefore  not 
likely  related  to  residual  monitoring.  However,  there  also  appear  to  be  a  few  residuals 
which  are  rejected,  but  do  not  correspond  to  measurement  outliers. 

The  rejection  of  feasible  measurements  indicates  that  there  is  a  significant 
discrepancy  between  the  measurement  prediction  based  on  the  propagated  state  and  the 
realized  measurement  at  those  epochs.  This  discrepancy  may  be  related  to  the  simplified 
dynamic  model  and  the  way  measurements  update  the  Kalman  filter.  The  constant 
velocity  model  used  in  this  thesis  forces  the  mean  velocity  between  epochs  to  remain 
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constant  and  only  allows  the  uncertainty  in  the  velocities  to  grow.  The  value  of  velocity 
states  can  only  be  changed  by  an  update.  Due  to  the  measurement  model,  the  velocities 
are  not  updated  by  the  filter  directly.  Instead,  the  velocity  states  are  updated  indirectly  by 
the  Kalman  gain  matrix  K  computed  for  each  update.  Figure  4.19  shows  the  X,  Y,  and  Z 
velocities  for  both  vehicles  as  computed  by  the  Kalman  filter  for  each  sample. 
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(a)  Lead  Estimate  (b)  Trail  Estimate 

Figure  4.19:  Kalman  Filter  Velocity  Estimates.  The  velocities  estimated  by  the  filter  are 
shown  for  the  lead  (a)  and  trail  (b)  vehicles 


Figure  4.19  reveals  the  combined  effects  of  the  propagation  model  and  measurement 
model.  The  estimated  velocities  in  the  X  direction  of  sample  one  for  both  the  lead  and  trail 
vehicles  do  not  appear  to  reach  a  steady  state  as  would  be  expected,  but  rather  they 
oscillate.  This  oscillation  is  also  observable  in  the  Y  and  Z  directions  and  likely  accounts 
for  the  steady  drift  of  sample  two’s  estimated  trajectory  outside  the  walls  of  the  test  course. 
Inaccuracies  in  the  estimated  velocities  fold  into  erroneous  predictions  by  the  propagation 
model  which  results  in  feasible  measurements  being  rejected  by  residual  monitoring. 
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4.6.6  Re-Tuning  the  Kalman  Filter. 

The  results  of  the  last  section  suggest  that  the  inconsistency  in  the  Kalman  filter 
estimate  was  caused  by  inaccuracies  in  the  predicted  measurement  as  computed  by  the 
propagation  model.  However,  Figure  4.17  did  show  that  the  measurement  errors  are  well 
characterized  by  their  computed  covariance.  In  this  section,  the  Kalman  filter  parameters 
will  be  re-tuned  to  reduce  confidence  in  the  propagation  model  and  increase  confidence  in 
the  measurements.  To  reduce  the  confidence  in  the  propagation  model,  process  noise  is 
added  to  the  propagation  equations  via  the  noise  strength  matrix  for  each  vehicle  Q The 
new  noise  strength  matrix  for  each  vehicle  is 


Q 


0.5  0  0  0  0 

0  0.5  0  0  0 

0  0  0.5  0  0 

0  0  0  0.001  0 

0  0  0  0  0.1 


(4.4) 


Increasing  the  propagation  noise  also  inflates  the  residual  covariance,  which  creates 
the  potential  for  accepting  outliers.  Therefore,  the  residual  threshold  is  reduced  to  2.5 -cr. 
Once  again,  the  measurements  and  body  angles  were  processed  by  the  Kalman  filter  to 
produce  estimates  of  the  vehicle  trajectories  for  each  sample.  The  resulting  trajectories 
and  final  positions  are  shown  in  Figure  4.20  and  summarized  in  Table  4.5. 
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(c)  Trail  Vehicle  Trajectory 


(d)  Final  Position  Estimate  (Trail) 


Figure  4.20:  Trajectory  Estimates  after  Re-Tuning.  The  recomputed  trajectories  (left)  and 
final  position  estimates  (right)  are  shown  for  the  lead  and  trail  vehicles  for  each  sample. 
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Table  4.5:  Final  Position  Errors  After  Re-Tuning. 


Sample 

Lead  RMS  Error  (m) 

Trail  RMS  Error  (m) 

1 

2.503 

2.577 

2 

2.064 

2.053 

3 

2.205 

2.089 

Although  the  error  in  the  final  position  estimate  increases  for  samples  two  and  three, 
the  error  in  sample  one  becomes  consistent  with  the  covariance  calculated  by  the  Kalman 
filter.  This  consistency  does  not,  however,  resolve  the  other  issue  pointed  out  in  Section 
4.6.5  -  the  velocity  estimates.  Figure  4.21  shows  the  Kalman  filter  estimated  velocities 
after  re-tuning. 


(a)  Lead  Estimate  (b)  Trail  Estimate 

Figure  4.21:  Kalman  Filter  Velocity  Estimates  After  Re-Tuning.  The  velocities  estimated 
by  the  filter  are  shown  for  the  lead  (a)  and  trail  (b)  vehicles 
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While  more  confidence  is  given  to  the  measurements,  there  is  not  a  significant 
improvement  in  the  velocity  estimates.  In  fact,  the  velocity  estimates  may  have  become 
slightly  worse.  By  increasing  confidence  in  the  measurements  and  increasing  process 
noise,  the  filter  is  less  able  to  filter  out  measurement  noise.  Instead,  the  filter  follows  the 
measurements  as  they  jump  around  due  to  measurement  noise. 

One  possible  solution  is  to  robustify  the  propagation  model.  The  goal  of  robustifying 
the  propagation  model  would  be  to  increase  the  prediction  accuracy  of  the  propagation 
model  while  minimizing  the  required  amount  of  process  noise.  Minimizing  the  required 
amount  of  process  noise  would  reduce  the  filter’s  dependence  on  highly  accurate 
measurements  and  would  likely  result  in  better  velocity  estimates.  A  more  robust 
propagation  model  will  be  left  for  future  work  and  will  be  briefly  discussed  in  the  next 
chapter. 

4.6.7  Localization  Improvement. 

As  a  final  comparison  point,  this  section  will  show  that  the  localization  algorithm 
developed  in  this  thesis  has  a  clear  performance  improvement  over  the  system  designed  in 
[18].  Recall  that  in  [18]  the  on-board  velocity  estimation  algorithm  required  that  texture 
be  added  to  the  test  course  (in  the  form  of  tape  strips  on  the  floor)  in  order  to  return 
reasonable  velocity  estimates.  Figure  4.22  compares  the  unfiltered  trajectory  estimate 
found  by  integrating  the  on-board  velocities  with  the  unfiltered  trajectory  estimate  found 
by  taking  the  cumulative  sum  of  the  delta-position  information  contained  in  the 
measurements  from  the  batch  processor  on  the  same  test  course,  without  any 
modifications. 

According  to  Figure  4.22,  there  is  a  clear  improvement  in  the  amount  of  motion 
information  contained  in  measurements  from  the  velocity  measurements  used  in  [18]  to 
the  feature  based  measurements  developed  in  this  thesis.  The  on-board  velocities  tend  to 
underestimate  the  true  velocity  of  the  vehicle.  This  is  likely  related  to  the  optical  flow 
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Figure  4.22:  Improvements  Over  Previous  System.  The  estimated  final  positions  found 
by  integrating  on-board  velocities  are  plotted  against  the  estimated  final  positions  found 
by  taking  the  cumulative  sum  of  delta-positions  from  the  measurements  developed  in  this 
thesis. 


technique  [8]  [31]  used  by  the  on-board  velocity  estimation  algorithm  to  numerically 
calculate  the  flow  of  an  image  sequence  and  estimate  velocity.  This  technique  depends  on 
rich  texture  to  detect  image  motion  and  is  greatly  impacted  when  image  motion  is 
unobservable  [8]  (e.g.  the  image  sequence  only  captures  a  constant  surface  such  as  a  blank 
wall  or  floor).  The  unmodified  floor  pattern  in  the  test  course  often  exhibits  a  slowly 
changing  pattern  which  may  cause  the  optical  flow  algorithm  to  determine  there  is  no 
image  motion  and  cause  the  performance  degradation  observed  in  Figure  4.22.  The 
algorithms  in  this  thesis  do  not  suffer  these  same  effects  since  they  process  images  from 
the  forward  looking  cameras  which  naturally  contain  more  texture  than  the  downward 
looking  cameras. 
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4.7  Summary 

In  this  chapter,  a  series  of  experiments  were  conducted  to  investigate  the  performance 
of  the  algorithms  designed  in  the  previous  chapter.  It  was  shown  that  the  feature 
correspondence  and  triangulation  algorithms  are  capable  of  determining  a  well  localized 
set  of  matching  feature  points  between  two  synchronized  images  and  two  separate  time 
epochs.  The  relative  camera  pose  estimation  algorithm  was  shown  to  be  capable  of 
accurately  determining  the  relative  position  and  orientations  of  the  two  vehicles.  Testing 
of  the  batch  processor  showed  that  the  measurements  it  returned  were  well  characterized 
by  their  computed  covariance  and  are  approximately  Gaussian.  The  Kalman  filtering 
algorithm  was  validated  independently  from  the  batch  processor. 

After  linking  the  batch  processor  to  the  Kalman  filter,  it  was  found  that  the  error  in 
the  estimated  trajectory  computed  by  the  filter  was  not  consistent  with  the  true  error  in  the 
trajectory  estimate.  This  inconsistency  was  found  to  be  related  to  the  prediction  accuracy 
of  the  propagation  model.  Inflating  the  process  noise  in  the  propagation  model  resulted  in 
more  consistent  final  trajectory  estimates,  but  did  not  solve  the  issue  of  inaccurate  velocity 
estimation.  A  more  robust  propagation  model  is  needed  to  reckon  with  the  inaccuracies  in 
velocity  estimation. 
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V.  Conclusions 


The  goal  of  this  research  was  to  expand  on  the  system  developed  in  [18]  and 

improve  the  navigation  algorithm  accuracy  by  developing  an  alternative  method  for 
estimating  vehicle  motion  based  on  cooperative  binocular  stereopsis.  In  Chapter  3,  the 
system  developed  in  [18]  was  expanded  to  support  multiple  vehicle  communications  and 
process  the  synchronized  image  and  navigation  data  from  two  ARDrone  quadrotor 
vehicles.  The  complete  algorithm  consisted  of  multiple  child  processes  including  feature 
correspondence,  feature  triangulation,  target  tracking,  and  relative  camera  pose 
estimation,  as  well  as  vanishing  point  estimation  as  detailed  in  [18].  The  triangulated 
feature  points  and  relative  camera  positionsd  were  simultaneously  processed  by  a  batch 
algorithm  to  produce  a  single  estimate  of  the  motion  of  both  vehicles  between  two  time 
epochs.  Vanishing  point  measurements  and  the  batch  measurement  were  then  fused  in  a 
delayed  state  EKF  to  produce  an  estimate  of  the  vehicles’  trajectories. 

Experiments  were  conducted  in  phases  to  validate  each  piece  of  the  expanded 
navigation  software.  The  feature  matching  experiment  determined  an  appropriate  tuning 
for  the  OpenCV  feature  matching  objects  used  in  this  research.  The  relative  camera  pose 
estimation  accuracy  experiment  characterized  the  errors  in  estimating  relative  camera 
position  and  orientation.  The  triangulation  accuracy  experiment  characterized  the  errors  in 
the  feature  triangulation  algorithm  and  determined  a  feasible  range  of  parallax  angles  to 
accept  feature  correspondences.  The  batch  processor  experiment  confirmed  that  the  true 
solution  error  was  consistent  with  the  error  computed  by  the  batch  processor  and  that  the 
measurements  computed  by  the  batch  processor  were  approximately  Gaussian.  It  was  also 
found  that  enforcing  a  constraint  on  the  batch  solution  errors  improved  the  Kalman  filter 
estimate.  The  Kalman  filter  experiments  were  able  to  validate  the  delayed  state  EKF 
algorithm  for  the  case  of  a  constant  measurement  covariance. 
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Integrating  the  batch  processor  and  the  Kalman  filtering  algorithm  revealed  a 
limitation  of  the  propagation  model.  In  order  for  the  Kalman  filter  to  produce  a  position 
estimate  and  covariance  consistent  with  the  true  position  error,  the  process  noise  needed  to 
be  inflated  to  artificially  lower  confidence  in  the  propagation  model.  While  this  produced 
a  more  consistent  estimate,  it  did  not  result  in  any  significant  improvement  to  velocity 
estimation  which  is  important  for  autonomous  navigation. 

5.1  Further  Testing 

The  experiments  in  Chapter  4  were  able  to  validate  the  algorithms  developed  in  this 
thesis  for  the  simple  case  of  two  vehicles  traveling  down  a  single  hallway  in  a  fixed 
formation.  This  section  presents  a  number  of  additional  tests  to  further  validate  its 
operation  in  other  scenarios  and  against  other  algorithms  (e.g.  SLAM). 

5.1.1  Flight  Path  and  Formation  Control. 

In  the  previous  work  [18],  the  developed  algorithm  was  tested  on  a  course  which 
included  a  90°  turn,  which  verified  that  the  vanishing  point  algorithm  could  handle  a 
significant  change  in  heading.  The  algorithms  in  this  thesis,  however,  have  not  been  tested 
in  cases  where  the  navigated  path  includes  turns,  changes  in  altitude,  or  when  the  vehicles 
are  allowed  to  move  in  formation.  These  tests  are  needed  to  confirm  that  the  algorithms 
designed  in  this  thesis  could  be  implemented  in  an  autonomous  system. 

5.1.2  Runtime  Concerns. 

The  system  developed  in  this  thesis  also  neglects  some  important  runtime  concerns. 
For  instance,  the  system  was  not  designed  to  operate  during  a  communication  loss. 
Precautions  were  taken  to  ensure  that  communications  were  not  lost  during  testing.  The 
system  is  also  not  robust  to  the  trail  vehicle  losing  lock  on  the  lead  vehicle  and  it  does  not 
possess  an  algorithm  for  formation  control.  In  order  to  implement  these  algorithms  in  a 
real  system,  these  capabilities  must  be  developed  and  tested. 
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5.1.3  Feature  Depletion. 

The  testing  environment  used  to  validate  the  system  developed  in  this  thesis  consisted 
of  enough  features  throughout  the  course  to  enable  reliable  feature  matching.  It  is  not 
reasonable  to  assume  that  the  system  designed  in  this  thesis  could  operate  effectively  in  an 
environment  with  sparse  features  or  significant  changes  in  lighting  conditions.  Therefore, 
the  system  designed  in  this  thesis  needs  to  be  tested  in  these  environments  to  determine  its 
capability. 

5.1.4  Comparison  to  State  of  the  Art. 

The  cooperative  binocular  stereopsis  algorithm  developed  in  this  thesis  suggests  it  is 
an  improvement  over  competing  algorithms.  In  terms  of  its  performance  against  SLAM 
algorithms,  cooperative  binocular  stereopsis  may  have  an  advantage  in  computational  load 
while  maintaining  an  equivalent  level  of  localization  accuracy  as  SLAM.  In  terms  of  its 
performance  against  stereo  vision  algorithms,  cooperative  binocular  stereopsis  may  have 
an  advantage  in  localizing  features  in  the  environment  since  the  camera  baseline  is 
allowed  to  be  much  larger  than  a  binocular  camera  pair  mounted  on  a  single  vehicle.  In 
order  to  make  a  real  comparison  of  cooperative  binocular  stereopsis  with  similar 
algorithms,  the  image  and  navigational  data  collected  during  experiments  should  be 
processed  by  these  similar  algorithms  to  determine  if  an  advantage  exists. 

5.2  Future  Work 

There  are  a  number  of  improvements  that  could  be  investigated  in  follow-on  work. 
These  improvements  include  more  efficient  algorithm  implementations  and  changes  to  the 
mathematical  properties  of  the  algorithms,  such  as  absolute  positon  measurements  and  an 
improved  dynamic  model. 

5.2.1  Constrained  Feature  Matching. 

One  of  the  primary  weaknesses  in  the  feature  matching  algorithm  designed  in  this 
thesis  is  the  separation  of  the  feature  matching  algorithm  from  the  Kalman  filter  estimate. 
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The  Kalman  filter  provides  a  prediction  of  the  motion  of  both  vehicles,  along  with  a  level 
of  confidence  in  that  estimate  -  the  covariance.  This  motion  could  be  used  to  predict  a 
region  of  pixels  in  another  image  where  the  feature  match  is  likely  to  be. 

5.2.2  Absolute  Position  Measurements. 

A  limitation  of  the  algorithms  developed  in  this  thesis,  and  of  the  various  SLAM 
algorithms  found  in  literature,  is  that  the  position  uncertainty  will  eventually  diverge.  This 
is  because  no  absolute  information  is  given  to  the  filter.  Absolute  measurements  manifest 
in  image-aided  navigation  as  landmarks,  which  are  known  features  in  the  environment.  As 
Veth  points  out  [66],  the  main  difference  between  relative  measurements  and  absolute 
measurements  is  the  independence  of  the  landmark  position  and  and  vechicle  position. 

For  absolute  measurements,  the  landmark  position  and  vehicle  position  are  uncorrelated, 
such  that  the  reduction  in  position  uncertainty  is  greater  than  for  relative  measurements. 

5.2.3  Computational  Improvements. 

The  software  developed  for  this  thesis  was  not  designed  to  optimize  computational 
load.  Computational  optimization,  although  not  specifically  investigated  in  this  thesis,  is 
important  in  a  real  system  with  limited  computational  resources.  The  following 
subsections  outline  two  of  the  most  obvious  areas  that  could  use  improvement. 

5.2.3. 1  Sparse  Linear  Algebra. 

The  Jacobians  computed  in  the  measurement  processor  and  in  the  Kalman  filter  have 
an  obvoius  sparse  structure.  There  are  many  techniques  in  the  linear  algebra  literature  for 
solving  sparse  systems  of  equations  using  Schur  decomposition  or  graph  based 
techniques.  As  an  example,  Hartley  and  Zisserman  give  an  example  of  a  partitioned 
Levenburg-Marquardt  algorithm  which  takes  advantage  of  the  sparse  nature  of  the  bundle 
adjustment  problem  [28].  A  similar  approach  could  be  taken  within  the  measurement 
processing  algorithm  to  partition  the  state  of  the  measurement  processor  into  vehicle 
positions  and  feature  positions. 
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Because  of  the  high  accuracy  of  relative  pose  measurements,  a  simpler  option  would 
be  to  convert  the  Levenburg-Marquardt  algorithm  in  Section  3.6  into  a  linearly 
constrained  least  squares  problem,  where  the  relative  pose  measurements  form  the 
constraints  on  the  solution  space.  Since  the  only  other  measurements  in  the  batch 
algorithm  are  feature  correspondences,  the  problem  becomes  a  classical  linearly 
constrained  bundle  adjustment  problem. 

5.2.3. 2  Parallelization. 

The  image  processing  functions  and  matrix  calculations  throughout  the  algorithms  in 
this  thesis  could  have  been  massively  parallelized.  A  graphics  processing  unit  (GPU)  is  a 
piece  of  hardware  which  is  capable  of  performing  many  calculations  in  parallel.  It  is 
especially  useful  for  image  data,  but  can  also  be  used  to  solve  linear  algebra  equations, 
especially  those  with  sparsity.  OpenCV  supports  the  creation  and  manipulation  of 
matrices  using  a  GPU  and  has  a  GPU  optimized  library  specifically  for  feature  matching. 

5.2.4  Improved  Dynamic  Model. 

The  simplistic  dynamic  model  used  in  this  thesis  proved  to  be  an  obstacle  for 
accurate  position  estimation.  A  more  appropriate  dynamic  model  would  include  inputs  in 
the  dynamic  model,  leading  to  more  accurate  measurement  predictions  without  inflating 
process  noise.  One  source  of  dynamic  model  inputs  are  the  raw  inertial  measurements. 
The  measurement  equations  could  be  redefined  to  correspond  to  the  bias  terms  of  an  IMU. 
Unfortunately,  as  mentioned  in  Chapter  3,  the  raw  inertial  measurements  are  not 
accessible.  Thus,  this  approach  would  require  a  different  platform. 

Another  approach  would  be  to  use  the  reported  body  angles  to  estimate  the  velocities 
of  each  vehicle.  There  is  an  approximately  proportional  relationship  between  pitch  angle 
and  forward  velocity.  The  same  relationship  exists  between  yaw  and  sideways  velocity. 

By  using  the  corrected  heading  angle,  the  velocities  of  each  vehicle  in  the  navigation 
frame  could  be  estimated  and  used  to  propagate  the  system  to  the  next  update  time. 
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5.3  Final  Conclusion 


This  research  developed  a  cooperative  localization  algorithm  based  on  two- view 
geometry  which  overcomes  weaknesses  in  previous  work  and  is  able  to  estimate  the 
trajectory  of  two  independent  vehicles  from  tracking  feature  points  over  only  two  time 
epochs.  With  further  testing,  this  algorithm  has  the  potential  as  an  alternative  to  treating 
feature  correspondences  in  a  SLAM  network.  This  research  also  provided  another 
example  of  the  ARDrone’s  utility  as  a  platform  for  vision-aided  navigation  research. 
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